Machine learning

“A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.” (Mitchell, 1997)

The goal of most learning tasks is to enable a model to infer a functional relationship between inputs and outputs relating to the task. Whether it be classifying images or learning to play chess, there is a model learning about a process through observations, typically in the form of (input, output) pairs. Most importantly, we want our model to perform well on data it’s never seen before, i.e. true data in the real world; classifying random images across the internet or playing good chess moves against various play styles.

To find such a model, we want to balance how well it performs on the data we have available (training data) and the complexity of the model. Models with greater complexity are typically more powerful and can fit more functionally complex data, but are much more sensitive to variance in the data and can perform terribly if overfitting has occurred. Simpler models often have the advantage of being more explainable (Occam’s razor) and are less likely to overfit on noise in the data, better modeling the underlying patterns.

Let’s take a look at how these principles can be generalized and used across a variety of model formulations via empirical and structural risk minimization.

Setup

Suppose we have a dataset of training examples

$D = \{(x_1,y_1),\dots,(x_n,y_n)\}$

and our goal is to learn some functional relationship

$y = h(x), h:\mathcal{X} \rightarrow \mathcal{Y}$

where $h$ is our hypothesis function, mapping from the input space X to the output space $Y$. Here $h$ refers to the whole model function, whether it be classification or regression. $h$ represents a perfect “dream” function that always gives the correct output for any input.

Risk

Risk is a term heavily correlated with loss or cost. We define the risk associated with a hypothesis $h$ as the expected loss:

$R(h) = \mathbb{E} \, [ \, L(h(x),y) \, ] = \int{L(h(x),y) \, dP(x,y)}$

where $P(x, y)$ is the joint probability distribution over the input and output space from which we assume our training examples are sampled i.i.d, and $L(h(x), y)$ is the loss function. The goal of a learning algorithm is to find the hypothesis $h$ from the hypothesis space $H$ which minimizes risk $R$. That is,

$h^* = \text{argmin}_{h \, \in \, \mathcal{H}}R(h)$

Note that risk here is often called true risk or test error.

Empirical Risk Minimization

We don’t know the joint probability over the data $P(x, y)$, so we cannot compute the risk of hypotheses directly. We can instead use empirical risk:

$R_{\text{emp}}(h) = \frac{1}{n}\sum_{i=1}^{n}{ L(h(x_i),y_i) }$

which is another term for training error. This is the best approximation of the true risk associated with $h$ with respect to our finite samples from $P(x, y)$.

Structural Risk Minimization

Strictly minimizing the training error without checks on model complexity often lead to extreme overfitting and thus poor performance on test data (i.e. large true risk). Structural risk minimization incorporates regularization into the formulation to help penalize higher model complexity, reducing the chance we overfit on training data. Structural risk is defined as follows:

$R_{\text{srm}}(h) = \frac{1}{n}\sum_{i=1}^{n}{ L(h(x_i),y_i) } \, + \, \lambda R(w)$

where $R$ is a regularizer that penalizes large model complexity, $w$ is the vector of model parameters, and $\lambda$ is a hyperparameter used to control the trade-off between training loss and regularization. Minimizing this structural risk is theoretically better than simply minimizing the empirical risk, leading to a lower true risk. Structural risk minimization is the generalized problem that most classical machine learning models attempt to solve, differing primarily by their choice of loss function and regularizer.