Solving hard real-world problems with machine learning demands advanced algorithmic designs. These designs should be based on a solid understanding of the principles that determine the behavior of learning algorithms. Statistical learning theory provides mathematical tools to develop such an understanding. It seeks answers to fundamental questions about the nature of learning from data such as
- Under which conditions can a learning algorithm generalize from the training data to new data?
- How can we measure the complexity of a learning task?
- How can we control the complexity of a learning algorithm?
- How can we estimate the generalization performance of a learning algorithm?
Based on what we covered in the probability theory chapter, let us start from investigating the last question.
Generalization bounds¶
Remember the basic definitions below.
Definition 1 (Generalization error) Given a hypothesis $h \in \mathcal{H}$, a loss function $\ell: Y \times Y \rightarrow [0,1]$ a data distribution $x,y \sim D$, the generalization error of $h$ is defined as
$R(h) = P_{x,y \sim D}[\ell(h(x),y)]$.
Definition 2 (Empirical error) Given a hypothesis $h \in \mathcal{H}$, a loss function $\ell: Y \times Y \rightarrow [0,1]$, and a data set $S = \{(x_1,y_1),...,(x_m,y_m)\}$, the empirical error of $h$ is defined as
$\widehat{R}_S(h) = \frac{1}{m} \sum_{i=1}^m \ell(h(x_i),y_i)$.
Using the definitions above and the concentration inequalities we covered in the probability theory chapter, we can derive the following theorem.
Theorem 1 (Generalization bound) Let $\mathcal{H}$ be a finite hypothesis set. Then for any $\delta \in (0,1)$, with probability at least $1-\delta$ over the choice of $S \sim D^m$, for all $h \in \mathcal{H}$
$R(h) \leq \widehat{R}_S(h) + \sqrt{\frac{\log |\mathcal{H}| + \log \frac{1}{\delta}}{2m}}$.
Proof. Let $\mathcal{H} = \{h_1,...,h_{|\mathcal{H}|}\}$. Then we have
\begin{align*} P_{S \sim D^m}\left[\exists h_i \in \mathcal{H}: |\widehat{R}_S(h_i) - R(h_i)| > \epsilon \right ] &=P_{S \sim D^m}\left [\cup_{i=1}^{|\mathcal{H}|} \Big \{ |\widehat{R}_S(h_i) - R(h_i)| > \epsilon \Big \} \right ] \\ &\leq \sum_{i=1}^{|\mathcal{H}|} P_{S \sim D^m}[ |\widehat{R}_S(h_i) - R(h_i)| > \epsilon] \\ &\leq |\mathcal{H}| \exp(-2m\epsilon^2). \end{align*} where the first inequality follows from the union bound and the second from Hoeffding's inequality. Setting the right hand side to $\delta$ and solving for $\epsilon$ yields the final result. $\square$
This inequality is called a generalization bound, because it bounds the generalization error of any hypothesis $h \in \mathcal{H}$ in terms of its empirical error. We can derive a number of interesting consequences from this bound:
- The bound is uniform in the sense that it holds simultaneously for all hypotheses in $\mathcal{H}$.
- The bound is independent of the data distribution $D$ and the loss function $\ell$.
- The bound increases logarithmically with the size of the hypothesis set $|\mathcal{H}|$, that is, a richer hypothesis set is more likely to overfit.
- The bound decreases with the size of the training set $m$, that is, more data is less likely to overfit.
- For a fixed training set size $m$ and two different hypothesis sets $\mathcal{H}_1$ and $\mathcal{H}_2$, the bound prefers the smaller hypothesis set. This is known as the Occam's razor principle. The principle is introduced by the 14th century theologian William of Ockham. It states that among competing hypotheses, the one with the fewest assumptions should be selected.
- The bound gives a Probably Approximately Correct (PAC) performance guarantee. The event that any hypothesis in $\mathcal{H}$ is approximately correct in the sense that its generalization error is at most $\epsilon$ with probability at least $1-\delta$.
Definition 3 (Bayes Error) Given a data distribution $x,y \sim D$, the Bayes error is defined as:
$R^* = \min_{h \in \mathcal{H}} R(h)$.
A hypothesis $h^* \in \mathcal{H}$ that achieves the Bayes error is called a Bayes (optimal) hypothesis.
For a binary classification problem with $Y = \{0,1\}$, the Bayes hypothesis would be
$h^*(x) = \arg \max_{y \in \{0,1\}} P[y|x]$.
Then the related Bayes error is given by
$R^* = 1 - \max_{y \in \{0,1\}} P[y|x] = \min_{y \in \{0,1\}} P[y|x]$. This quantity is also denoted as noise.
Bias-Variance Tradeoff¶
Theorem 2 (No free lunch theorem) Let $A$ be any binary classification algorithm for 0-1 loss. Then there exists a distribution $D$ over $\mathcal{X} \times \{0,1\}$ such that for any $h \in \mathcal{H}$, $P_{(x,y) \sim D}[A(x) \neq y] \geq \frac{1}{2}$.
The take-home message of the theorem is that there is no universal learner that works well for all data distributions. There always exists a task for which the learner fails. This is a very important result. It tells us that we need to make assumptions about the data distribution in order to design a good learner. These assumptions will constitute the inductive bias of the learner. The no free lunch theorem tells us only that we need to induce a degree of bias to the learner. However, it does not tell anything about its consequences. Inducing too much bias limits the ability of the learner to explain the training observations. Inducing too little bias leads to overfitting. The goal is to find the right balance between the two. This dilemma is known as the bias-variance tradeoff. Let us describe it in more formal terms.
Consider a regression problem where observations $(x,y) \sim D$. Our quantity of interest is $E[y|x]$.
We are up to estimating this quantity from a training set $S = \{(x_1,y_1),...,(x_m,y_m)\}$. For a given test-time sample $x$. Our estimator for $E[y|x]$ is a hypothesis $h_S \in \mathcal{H}$ that minimizes the mean squared error loss. The expected squared error of the prediction made by this estimator over the noisy label $y$ and training sample $S$ is given by
\begin{align*} E_{S, y|x} \Big [ \Big(y - h_{S}(x) \Big)^2 \Big ]&=E_{S, y|x} \Big [ y^2 - 2 y h_{S}(x) + h_{S}(x)^2 \Big ]\\ &=E_{y|x} [ y^2] + E_{S} [ h_{S}(x)^2 ] - 2 E_{y|x} [ y ] E_{S} [ h_{S}(x) ] \\ &=E_{y|x} [ y^2] + Var_{y|x}[y] + E_{S} [ h_{S}(x) ]^2 + Var_{S} [ h_{S}(x) ]- 2 E_{y|x} [ y ]E_{S} [ h_{S}(x) ]\\ &= \Big ( \underbrace{E_{y|x} [ y ] -E_{S} [ h_{S}(x) ]}_{\text{Estimator~Bias}} \Big )^2 + \underbrace{Var_{S} [ h_{S}(x) ]}_{\text{Estimator~Variance}}+\underbrace{Var_{y|x}[y]}_{\text{Label~noise~variance}} \end{align*}
Note that $E_{S|x}[h_S(x)] = E_S[h_S(x)]$ and $Var_{S|x}[h_S(x)] = Var_S[h_S(x)]$ since $S$ is collected independently from $x$.
Let us take the k-nearest neighbors regressor as an example and choose the hypothesis (i.e. estimator) to be $h_S(x) = \dfrac{1}{k} \sum_{i=1}^k f(x_i)$, where $\{(x_1, y_1),\cdots,(x_k, y_k)\} \in S$ are the $k$ nearest neighbors to $x$. Then, we have
\begin{align*} E_{S,y|x} \Big [ \Big(y - h_{S}(x) \Big)^2 \Big ]= \left (E_{y|x} [ y ] - E_{S} \left [ \dfrac{1}{k} \sum_{i=1}^k y_k \right ] \right )^2 + Var_{S} \left [ \dfrac{1}{k} \sum_{i=1}^k f(x_k) \right ] \end{align*}
For a fixed data set $S$, consider the following two cases:
- $k=N$: The estimator is the sample mean. Then $h_S$ will predict the sample mean regardless of $x$, causing high estimator bias and zeros estimator variance. The model is likely to underfit because the sample mean will not explain all the variation in the way $x$ affects $y$.
- $k=1$: The estimator is the nearest neighbor. Then $h_S$ will predict the label of the nearest neighbor regardless of $x$, causing zero estimator bias and high estimator variance. The model is likely to overfit because the hypothesis will incur no error on $S$, while its error on the test set will be dependent on the representativeness of the particular training sample $S$.
Vapnik - Chervonenkis (VC) Dimension¶
The generalization bound developed above was for a finite hypothesis set. However, in practice we often deal with infinite hypothesis sets. In order to develop generalization bounds for infinite hypothesis sets, we need to define a measure of complexity for these sets. Vapnik and Chervonenkis develops such a measure that from a remarkable observation: In a binary classification problem, a hypothesis set can report a finite number of distinct labelings even though it contains an infinite number of hypotheses. The VC dimension builds on the notions of shattering, which in turn builds on the notion of growth function. Let us describe these notions stepwise and arrive at the definition of VC dimension.
Definition 4 (Growth function) Given a hypothesis set $\mathcal{H}$ and a data set $S = \{(x_1,y_1),...,(x_m,y_m)\}$, the growth function of $\mathcal{H}$ with respect to $S$ is defined as
$\Pi_{S,\mathcal{H}}(m) = \max_{x_1,...,x_m \in \mathcal{X}} |\{(h(x_1),...,h(x_m)): h \in \mathcal{H}\}|$.
In words, the growth function counts the maximum number of distinct labelings that $\mathcal{H}$ can report on any data set with $m$ data points.
Definition 5 (Shattering) A hypothesis set $\mathcal{H}$ shatters a data set $S$ of size $m$ if $\Pi_{S,\mathcal{H}}(m) = 2^m$.
Definition 6 (VC dimension) The VC dimension of a hypothesis set $\mathcal{H}$ is the size of the largest data set that $\mathcal{H}$ can shatter:
$d_{VC}(\mathcal{H}) = \max \{m: \Pi_{S,\mathcal{H}}(m) = 2^m\}$.
Let us conclude the chapter by developing a generalization bound for infinite hypothesis sets. As the first step, we need to bound the growth function of a hypothesis set. The following theorem does exactly that.
Lemma 1 (Sauer's lemma) Let $\mathcal{H}$ be a hypothesis set with VC dimension $d_{VC}$. Then for all $m \geq d_{VC}$, we have
$\Pi_{S,\mathcal{H}}(m) \leq \sum_{i=0}^{d_{VC}} {m \choose i}$.
Corollary 1 Let $\mathcal{H}$ be a hypothesis set with VC dimension $d_{VC}$. Then for all $m \geq d_{VC}$, we have
$\Pi_{S,\mathcal{H}}(m) \leq \left(\frac{em}{d_{VC}}\right)^{d_{VC}}$.
Using this corollary together with the generalization bound for finite hypothesis sets, we can derive the following generalization bound for infinite hypothesis sets.
Theorem 3 (Generalization bound for infinite hypothesis sets) Let $\mathcal{H}$ be a hypothesis set with VC dimension $d_{VC}$. Then for any $\delta \in (0,1)$, with probability at least $1-\delta$ over the choice of $S \sim D^m$, for all $h \in \mathcal{H}$
$R(h) \leq \widehat{R}_S(h) + \sqrt{\dfrac{2 d_{VC} \log \frac{em}{d_{VC}} }{m}} + \sqrt{\dfrac{\log \frac{1}{\delta}}{2m}}$.
The interpretation of this bound is similar to the one for finite hypothesis sets. The bound is uniform, independent of the data distribution and the loss function. It increases logarithmically with the VC dimension and decreases with the size of the training set. The bound is also PAC in the sense that it gives a performance guarantee for any hypothesis in $\mathcal{H}$.