Chapter 5: Statistical Learning Theory¶

Solving hard real-world problems with machine learning demands advanced algorithmic designs. These designs should be based on a solid understanding of the principles that determine the behavior of learning algorithms. Statistical learning theory provides mathematical tools to develop such an understanding. It seeks answers to fundamental questions about the nature of learning from data such as

Under which conditions can a learning algorithm generalize from the training data to new data?
How can we measure the complexity of a learning task?
How can we control the complexity of a learning algorithm?
How can we estimate the generalization performance of a learning algorithm?

Based on what we covered in the probability theory chapter, let us start from investigating the last question.

Generalization bounds¶

Remember the basic definitions below.

Definition 5.1 (Generalization error) Given a hypothesis $h \in \mathcal{H}$, a loss function $\ell: Y \times Y \rightarrow [0,1]$ a data distribution $x,y \sim D$, the generalization error of $h$ is defined as

$R(h) = P_{x,y \sim D}[\ell(h(x),y)]$.

Definition 5.2 (Empirical error) Given a hypothesis $h \in \mathcal{H}$, a loss function $\ell: Y \times Y \rightarrow [0,1]$, and a data set $S = \{(x_1,y_1),...,(x_m,y_m)\}$, the empirical error of $h$ is defined as

$\widehat{R}_S(h) = \frac{1}{m} \sum_{i=1}^m \ell(h(x_i),y_i)$.

Using the definitions above and the concentration inequalities we covered in the probability theory chapter, we can derive the following theorem.

Theorem 5.1 (Generalization bound) Let $\mathcal{H}$ be a finite hypothesis set. Then for any $\delta \in (0,1)$, with probability at least $1-\delta$ over the choice of $S \sim D^m$, for all $h \in \mathcal{H}$

$R(h) \leq \widehat{R}_S(h) + \sqrt{\frac{\log |\mathcal{H}| + \log \frac{1}{\delta}}{2m}}$.

Proof. Let $\mathcal{H} = \{h_1,...,h_{|\mathcal{H}|}\}$. Then we have

\begin{align*} P_{S \sim D^m}\left[\exists h_i \in \mathcal{H}: R(h_i) - \widehat{R}_S(h_i) > \epsilon \right ] &=P_{S \sim D^m}\left [\cup_{i=1}^{|\mathcal{H}|} \Big \{ R(h_i) - \widehat{R}_S(h_i) > \epsilon \Big \} \right ] \\ &\leq \sum_{i=1}^{|\mathcal{H}|} P_{S \sim D^m}[ R(h_i) - \widehat{R}_S(h_i) > \epsilon] \\ &\leq |\mathcal{H}| \exp(-m\epsilon^2). \end{align*} where the first inequality follows from the union bound and the second from Hoeffding's inequality. Setting the right hand side to $\delta$ and solving for $\epsilon$ yields the final result $\square$

This inequality is called a generalization bound, because it bounds the generalization error of any hypothesis $h \in \mathcal{H}$ in terms of its empirical error. We can derive a number of interesting consequences from this bound:

The bound is uniform in the sense that it holds simultaneously for all hypotheses in $\mathcal{H}$.
The bound is independent of the data distribution $D$ and the loss function $\ell$.
The bound increases logarithmically with the size of the hypothesis set $|\mathcal{H}|$, that is, a richer hypothesis set is more likely to overfit.
The bound decreases with the size of the training set $m$, that is, more data is less likely to overfit.
For a fixed training set size $m$ and two different hypothesis sets $\mathcal{H}_1$ and $\mathcal{H}_2$, the bound prefers the smaller hypothesis set. This is known as the Occam's razor principle. The principle is introduced by the 14th century theologian William of Ockham. It states that among competing hypotheses, the one with the fewest assumptions should be selected.
The bound gives a Probably Approximately Correct (PAC) performance guarantee. The event that any hypothesis in $\mathcal{H}$ is approximately correct in the sense that its generalization error is at most $\epsilon$ with probability at least $1-\delta$.

PAC Learnability¶

Definition 5.3 (Realizability). A hypothesis space $\mathcal{H}$ is realizable with respect to a loss $\ell$ and data distribution $D$ if $\exists h^* \in \mathcal{H}$ such that $R(h) = 0$.

It trivially follows from this definition that $\min_{h \in \mathcal{H}} \widehat{R}_S(h)=0$ with probability 1 for a sample set $S$ collected i.i.d. from $D$. Hence, under the realizability assumption, all ERM solutions $h_S \in \arg \min_{h \in \mathcal{H}} \widehat{R}_S(h)$ give zero error.

Definition 5.4 (Representativeness) A training set $S$ is called $\epsilon-$ representative if

$\forall h \in \mathcal{H}, |R(h) - \widehat{R}_S(h)| \leq \epsilon$.

Lemma 5.1 Assume that a training set $S$ is $\frac{\epsilon}{2}$-representative, then any ERM solution $h_S \in \arg \min_{h \in \mathcal{H}} \widehat{R}_S(h)$ satisfies

$\widehat{R}_S(h) \leq \min_{h \in \mathcal{H}} R(h) + \epsilon$

Proof. For every $h \in \mathcal{H}$, $R(h_S) \le R_S(h_S) + \frac{\epsilon}{2} \le R_S(h) + \frac{\epsilon}{2} \le R(h) + \frac{\epsilon}{2} + \frac{\epsilon}{2} = R(h) + \epsilon ~~\square$

Definition 5.5 (Uniform Convergence). A hypothesis class $\mathcal{H}$ has the uniform convergence property if there exists a function $m_{\mathcal{H}}^{\text{UC}} : (0, 1)^2 \to \mathbb{N}$ such that for every $\epsilon, \delta \in (0, 1)$ and every distribution $D$, any sampled data set $S=\{(x_i,y_i) \overset{i.i.d.}{\sim} D : i = 1, \ldots, m\}$ with $m \ge m_{\mathcal{H}}(\epsilon, \delta)$ is $\epsilon$-representative with probability at least $1-\delta$.

Definition 5.6 (Agnostic PAC Learnability) A hypothesis class $\mathcal{H}$ is agnostic PAC learnable if there exist a function $m_{\mathcal{H}}: (0, 1)^2 \to \mathbb{N}$ and a learning algorithm $A$ such that for every $\epsilon, \delta \in (0, 1)$ and every distribution $D$, running the learning algorithm on data set $S=\{(x_i,y_i) \overset{i.i.d.}{\sim} D : i = 1, \ldots, m\}$ with $m \ge m_{\mathcal{H}}(\epsilon, \delta)$ satisfies

$P(\{S \sim D: R(A(S)) \le \min_{h' \in \mathcal{H}} R(h') + \epsilon\}) \geq 1-\delta.$

Corollary 5.1 If $|\mathcal{H}| < \infty$ (finite hypothesis class), then it has the uniform convergence property with sample complexity

$m_{\mathcal{H}}^{\text{UC}}(\epsilon, \delta) \le \left\lceil \frac{\log(2|\mathcal{H}|/\delta)}{2\epsilon^2} \right\rceil.$

and is agnostically PAC learnable using the ERM algorithm with sample complexity

$m_{\mathcal{H}}(\epsilon, \delta) \le m_{\mathcal{H}}^{\text{UC}}(\epsilon/2, \delta) \le \left\lceil \frac{2\log(2|\mathcal{H}|/\delta)}{\epsilon^2} \right\rceil.$

Definition 5.7 (Bayes Error) Given a data distribution $x,y \sim D$, the Bayes error is defined as:

$R^* = \min_{h' \in \mathcal{H}} R(h')$.

A hypothesis $h^* \in \mathcal{H}$ that achieves the Bayes error is called a Bayes (optimal) hypothesis.

For a binary classification problem with $Y = \{0,1\}$, the Bayes hypothesis would be

$h^*(x) = \arg \max_{y \in \{0,1\}} P[y|x]$.

Then the related Bayes error is given by

$R^* = 1 - \max_{y \in \{0,1\}} P[y|x] = \min_{y \in \{0,1\}} P[y|x]$. This quantity is also denoted as noise.

Definition 5.8 (PAC learnability) A hypothesis class is PAC learnable with respect to a data distribution $D$ if it admits zero Bayes error, i.e., $R^* = 0$.

Bias-Complexity Dilemma (Trade-off)¶

Theorem 5.2 (No free lunch) Let $A$ be any learning algorithm for the task of binary classification with respect to the 0/1 loss over a domain $X$ and $m < |X|/2$ be a positive integer representing a training set size. Then, there exists a distribution $D$ over $X \times \{0, 1\}$ such that:

There exists a function $f : X \to \{0, 1\}$ with $R(f) = 0$.
With probability of at least $1/7$ over the choice of $S \sim D$ we have that $R(A(S)) \ge 1/8$.

Corollary 5.2 Let $X$ be an infinite domain ($|X|=\infty$, e.g. a real-valued feature space) and $\mathcal{H}$ be the set of all functions from $X$ to $\{0,1\}$. Then $\mathcal{H}$ is not PAC learnable.

Proof (Shalev-Shwartz book). Assume, by way of contradiction, that the class is learnable. Choose some $\epsilon < 1/8$ and $\delta < 1/7$. By the definition of PAC learnability, there must be some learning algorithm $A$ and an integer $m = m(\epsilon, \delta)$, such that for any data-generating distribution over $\mathcal{X} \times \{0, 1\}$, if for some function $f : X \to \{0, 1\}$, $R(f)=0$, then with probability greater than $1 - \delta$ when $A$ is applied to samples $S$ of size $m$, generated i.i.d. by $D$, $R(A(S)) \le \epsilon$. However, applying the No-Free-Lunch theorem, since $|X| > 2m$, for every learning algorithm (and in particular for the algorithm $A$), there exists a distribution $D$ such that with probability greater than $1/7 > \delta$, $R(A(S)) > 1/8 > \epsilon$, which leads to the desired contradiction $\square$

The take-home message of the above theorem and its corollary is that there is no universal learner that works well for all data distributions. There always exists a task for which the learner fails. This is a very important result. It tells us that we need to make assumptions about the data distribution in order to design a good learner. These assumptions will constitute the inductive bias of the learner. The no free lunch theorem tells us only that we need to induce a degree of bias to the learner. However, it does not tell anything about its consequences. Inducing too much bias limits the ability of the learner to explain the training observations. Inducing too little bias leads to overfitting. The goal is to find the right balance between the two. This dilemma is known as the bias-complexity dilemma.

Let us describe the bias-complexity dilemma in more formal terms. Given an ERM solution $h_S \in \arg \min_{h \in \mathcal{H}} \widehat{R}_S(h)$ for a hypothesis space $\mathcal{H}$, we can decompose its generalization error as below

$\underbrace{R(h_S)}_{\text{Generalization error}} = \underbrace{\min_{h \in \mathcal{H}} R(h)}_{\text{Approximation error}} + \underbrace{\epsilon_{est}}_{\text{Estimation error}}$

where the first term, called the approximation error is the Bayes error that stems from the randomness in the data distribution and the chosen hypothesis space. The second term is called the estimation error and denoted by $\epsilon_{est}$. This term is the remainder of the generalization error after the approximation error. Since for any $\mathcal{H'} \supset \mathcal{H}$, it holds that $\min_{h' \in \mathcal{H}'} R(h') \leq \min_{h \in \mathcal{H}} R(h)$, we can influence the approximation error via our choices on the hypothesis space. Other things such as the learning algorithm and data distribution being equal, the generalization error will give us a fixed budget. Hence, approximation and estimation error will need to be traded off against each other. Increasing the hypothesis space, thereby making our solution more complex will reduce the approximation error, but will increase the estimation error. Since the realizability assumption implies a smaller training error, we will observe overfitting in this scenario. The opposite will be true when we restrict the hypothesis space.

Bias-variance decomposition. The bias-complexity dilemma can also be observed from the bias and variance of the estimated values of a regression output. Consider a regression problem where observations $(x,y) \sim D$. Our quantity of interest is $E[y|x]$. We are up to estimating this quantity from a training set $S = \{(x_1,y_1),...,(x_m,y_m)\}$. For a given test-time sample $x$. Our estimator for $E[y|x]$ is a hypothesis $h_S \in \mathcal{H}$ that minimizes the mean squared error loss. The expected squared error of the prediction made by this estimator over the noisy label $y$ and training sample $S$ is given by

\begin{align*} E_{S, y|x} \Big [ \Big(y - h_{S}(x) \Big)^2 \Big ]&=E_{S, y|x} \Big [ y^2 - 2 y h_{S}(x) + h_{S}(x)^2 \Big ]\\ &=E_{y|x} [ y^2] + E_{S} [ h_{S}(x)^2 ] - 2 E_{y|x} [ y ] E_{S} [ h_{S}(x) ] \\ &=E_{y|x} [ y]^2 + Var_{y|x}[y] + E_{S} [ h_{S}(x) ]^2 + Var_{S} [ h_{S}(x) ]- 2 E_{y|x} [ y ]E_{S} [ h_{S}(x) ]\\ &= \Big ( \underbrace{E_{y|x} [ y ] -E_{S} [ h_{S}(x) ]}_{\text{Estimator~Bias}} \Big )^2 + \underbrace{Var_{S} [ h_{S}(x) ]}_{\text{Estimator~Variance}}+\underbrace{Var_{y|x}[y]}_{\text{Label~noise~variance}} \end{align*}

Note that $E_{S|x}[h_S(x)] = E_S[h_S(x)]$ and $Var_{S|x}[h_S(x)] = Var_S[h_S(x)]$ since $S$ is collected independently from $x$.

Example. Let us take the k-nearest neighbors regressor as an example and choose the hypothesis (i.e. estimator) to be $h_S(x) = \dfrac{1}{k} \sum_{i=1}^k f(x_i)$, where $\{(x_1, y_1),\cdots,(x_k, y_k)\} \in S$ are the $k$ nearest neighbors to $x$. Then, we have

\begin{align*} E_{S,y|x} \Big [ \Big(y - h_{S}(x) \Big)^2 \Big ]= \left (E_{y|x} [ y ] - E_{S} \left [ \dfrac{1}{k} \sum_{i=1}^k y_k \right ] \right )^2 + Var_{S} \left [ \dfrac{1}{k} \sum_{i=1}^k f(x_k) \right ] \end{align*}

For a fixed data set $S$, consider the following two cases:

$k=N$: The estimator is the sample mean. Then $h_S$ will predict the sample mean regardless of $x$, causing high estimator bias and zeros estimator variance. The model is likely to underfit because the sample mean will not explain all the variation in the way $x$ affects $y$.
$k=1$: The estimator is the nearest neighbor. Then $h_S$ will predict the label of the nearest neighbor regardless of $x$, causing zero estimator bias and high estimator variance. The model is likely to overfit because the hypothesis will incur no error on $S$, while its error on the test set will be dependent on the representativeness of the particular training sample $S$.

Vapnik - Chervonenkis (VC) Dimension¶

The generalization bound developed above was for a finite hypothesis set. However, in practice we often deal with infinite hypothesis sets. In order to develop generalization bounds for infinite hypothesis sets, we need to define a measure of complexity for these sets. Vapnik and Chervonenkis develops such a measure that from a remarkable observation: In a binary classification problem, a hypothesis set can report a finite number of distinct labelings even though it contains an infinite number of hypotheses. The VC dimension builds on the notions of shattering, which in turn builds on the notion of growth function. Let us describe these notions stepwise and arrive at the definition of VC dimension.

Definition 5.9 (Restriction). Given $S = \{x_1, \ldots, x_m \} \subset X$, the following set

$\mathcal{H}_S = \{ (h(x_1), \ldots, h(x_m)) : h \in \mathcal{H} \}$

is called a restriction of $\mathcal{H}$ to $S$.

In words, a restriction is the set of all unique label vectors a hypothesis class $\mathcal{H}$ can generate from a data set $S$.

Definition 5.10 (Shattering). $\mathcal{H}$ is said to shatter $S$ if $|\mathcal{H}_S|= 2^{|S|}$.

In words, a hypothesis class $\mathcal{H}$ shatters a data set $S$ if the restriction of $\mathcal{H}$ to $S$ is the set of all functions from $S$ to $\{0,1\}$.

Corollary 5.3 If a data set $S \subset X^{2m}$ with $2m$ data points (i.e. $|S|=2m$) is shattered by $\mathcal{H}$, then for all learning algorithms $A$, there exists a distribution $D$ such that for some $h \in \mathcal{H}$, we have $R(h)=0$ and $P(\{S' \sim D: L( A(S))\}) > 1/7$ with $|S'|=m$.

In words, if $\mathcal{H}$ is large enough to shatter a data set of $2m$ on a domain $X$, then we cannot find a learning algorithm as good as we desire (arbitrarily good) that can be run on data sets of size $m$ or smaller. The intuitive explanation is that due to the No Free Lunch theorem, under such circumstances, the hypothesis space $\mathcal{H}$ is so large that for any $2m$ data points, the true labels of the half that is not used in training can be perfectly explaned by a hypothesis inside $\mathcal{H}$. This is a manifestation of Karl Popper's falsifiability principle in machine learning theory. This principle sets a demarcation point between science and other types of knowledge generation. An explanation is scientific if there exists a set of observations that can falsify it.

Definition 5.11 (Growth function) The growth function $\tau_{\mathcal{H}}: \mathbb{N}^+ \rightarrow \mathbb{N}^+$ of $\mathcal{H}$ is defined as

$\tau_{\mathcal{H}}(m) = \max_{S \subset X^m} |\mathcal{H}_S|$.

In words, the growth function counts the maximum number of distinct labelings that $\mathcal{H}$ can report on any data set on domain $X$ that contains $m$ data points.

Note that $\mathcal{H}$ shatters $S \subset X^m$ if and only if $\tau_{\mathcal{H}}(m)=2^m$.

Definition 5.12 (VC dimension) The VC dimension of a hypothesis set $\mathcal{H}$ is the size of the largest data set that $\mathcal{H}$ can shatter:

$d_{VC}(\mathcal{H}) = \max \{m: \tau_{\mathcal{H}}(m) = 2^m\}$.

Let us conclude the chapter by developing a generalization bound for infinite hypothesis sets. As the first step, we need to bound the growth function of a hypothesis set. The following theorem does exactly that.

Theorem 5.3 (VC and PAC learnability). If $d_{VC}(\mathcal{H})=\infty$ then $\mathcal{H}$ is not PAC learnable.

Proof. For any training set size $m$, the infinite VC dimension allows $\mathcal{H}$ to shatter a data set of size $2m$. Then the result follows from Corollary 5.3 $\square$

This theorem is an analytical way of the argumentation presented above after the No Free Lunch theorem. Learning is only possible with some inductive bias. The bias-complexity dilemma stems from this obligation.

Theorem 5.4 (The Fundamental Theorem of Statistical Learning). Assume that $d_{VC}(\mathcal{H}) = d < \infty$. Then, there exist $C_1, C_2 \in \mathbb{R}^+$ such that

$\mathcal{H}$ has the uniform convergence property with sample complexity

$C_1 \frac{d + \log(1/\delta)}{\epsilon^2} \le m_{\mathcal{H}}^{\text{UC}}(\epsilon, \delta) \le C_2 \frac{d + \log(1/\delta)}{\epsilon^2}$
$\mathcal{H}$ is agnostic PAC learnable with sample complexity $C_1 \frac{d + \log(1/\delta)}{\epsilon^2} \le m_{\mathcal{H}}(\epsilon, \delta) \le C_2 \frac{d + \log(1/\delta)}{\epsilon^2}$
$\mathcal{H}$ is PAC learnable with sample complexity

$C_1 \frac{d + \log(1/\delta)}{\epsilon} \le m_{\mathcal{H}}(\epsilon, \delta) \le C_2 \frac{d \log(1/\epsilon) + \log(1/\delta)}{\epsilon}$

In words, the following statements are equal:

$\mathcal{H}$ has the uniform convergence property.
Any ERM rule is a successful agnostic PAC learner for $\mathcal{H}$.
$\mathcal{H}$ is agnostic PAC learnable.
$\mathcal{H}$ is PAC learnable.
Any ERM rule is a successful PAC learner for $\mathcal{H}$.
$\mathcal{H}$ has a finite VC-dimension.

This outcome is very useful because it allows one to achieve all these six nice properties by ensuring only one of them.

The VC dimension gives a binary characterization of the hypothesis class. One may want to analyze the capacity of a hypothesis class in relation to an arbitrary data set size smaller or larger than its VC dimension. The following lemma is instrumental to relate the growth function of a hypothesis class to sample size.

Lemma 5.2 (Sauer-Shelah-Perles) If $d_{VC}(\mathcal{H}) = d < \infty$, then for all $m \geq d$, we have $\tau_{\mathcal{H}}(m) \leq \sum_{i=0}^d {m \choose i}$. In particular, for $m \geq d$, we have $\tau_{\mathcal{H}}(m) \leq \left(\frac{em}{d}\right)^d$.

Theorem 5.5 (Sample complexity wrt growth function). Let $\mathcal{H}$ be a hypothesis class with growth function $\tau_{\mathcal{H}}$, then for every distribution $D$ on $X$, the following holds on a sample set $S=\{(x_i,y_i) \overset{i.i.d.}{\sim} D : i = 1, \ldots, m\}$ of size $m$:

$P\left (\left \{S \sim D : |R(h) - \widehat{R}_S(h)| \leq \frac{4+\sqrt{\log(\tau_{\mathcal{H}}(2m))}}{\delta \sqrt{2m}} \right \} \right ) \geq 1 - \delta$

for all $h \in \mathcal{H}$ simultaneously.

This result is consistent with the implications of the fundamental theorem of statistical learning. When $\lim_{m \rightarrow \infty} \tau_{\mathcal{H}}(m) < \infty$, then by Theorem 5.5 uniform convergence holds. Therefore the hypothesis class is PAC learnable.

Nonuniform Learnability¶

Definition 5.13 (Nonuniform Learnability) A hypothesis class $\mathcal{H}$ is nonuniformly learnable if there exist a function $m_{\mathcal{H}}^{NU}: (0, 1)^2 \times \mathcal{H} \to \mathbb{N}$ and a learning algorithm $A$ such that for every $\epsilon, \delta \in (0, 1)$ and every distribution $D$, running $A$ on data set $S=\{(x_i,y_i) \overset{i.i.d.}{\sim} D : i = 1, \ldots, m\}$ with $m \ge m_{\mathcal{H}}^{NU}(\epsilon, \delta, h)$ satisfies $P(\{S \sim D: R(A(S)) \le \min_{h' \in \mathcal{H}} R(h') + \epsilon\}) \geq 1-\delta.$

An agnostic PAC learnable hypothesis class $\mathcal{H}$ is also nonuniform learnable because $R(A(S)) \leq \min_{h' \in \mathcal{H}} R(h') + \epsilon$ results trivially in $R(A(S)) \leq R(h) + \epsilon, \forall h \in \mathcal{H}$. Hence, the set of non-uniform learnable hypothesis classes is larger than the agnostic PAC learnable hypothesis classes. With this relaxation, our motivation is to increase the model capacity while sustaining its learnability.

Theorem 5.5 If $\mathcal{H} = \cup_{i \in \mathbb{N}} \mathcal{H}_i$ such that each $\mathcal{H}_i$ has the uniform convergence property, then $\mathcal{H}$ is nonuniformly learnable.

Vice versa also holds. If a hypothesis class $\mathcal{H}$ is nonuniformly learnable, then it can be expressed as a countable union of agnostic PAC learnable subclasses. Next let us see how we can use this result to build a more powerful learning algorithm than ERM. Remember that we have sample complexity guarantees at the hypothesis subclass level while we need to do our search in the hypothesis level. The following quantity is instrumental to bridge this gap:

$\epsilon_i(m,\delta) = \min(\epsilon \in (0,1): m_{\mathcal{H}_i}^{UC}(\epsilon,\delta) \leq m)$

which in effect takes the inverse of the complexity function $m_{\mathcal{H}_i}^{UC}$ of hypothesis subclass $i$. This quantity gives the minimum error that will be incurred within hypothesis subclass $i$ when $m$ data points are available with probability $\delta$. In other words, the best level of representativeness reachable within this hypothesis subclass, i.e., $P(\{S: |R(h) - \widehat{R}_S(h)| \leq \epsilon_i(m,\delta)\}) \geq 1-\delta$ for the event of collecting a data set $S$ of size $m$ i.i.d. from any data distribution $D$.

Theorem 5.6 Let $w: \mathbb{N} \rightarrow [0,1]$ be a function with $\sum_{i=1}^\infty w(i) < 1$ and $\mathcal{H} = \cup_{i \in \mathbb{N}} \mathcal{H}_i$ such that each $\mathcal{H}_i$ has the uniform convergence property with $m_{\mathcal{H}_i}^{UC}$. Then for every $\delta \in (0,1)$, every data distribution $D$, and every $m \in \mathbb{N}$, and every $i \in \mathbb{N}$ and $h \in \mathcal{H}_i$ we have

$P\left ( \left \{S \sim D: |R(h) - \widehat{R}_S(h)| \leq \epsilon_i(m, w(i) \delta) \right \} \right ) \geq 1-\delta$

where $S$ is a data set of size $m$ sampled i.i.d. from $D$.

The goal of this theorem is to express the uniform convergence of a composite hypothesis class comprising subclasses with different uniform convergence guarantees. It only rescales the confidence level of the subclasses with an index function on the classes. As we will see next, this index will allow us to search the composite hypothesis space, thereby define a learning rule that is more powerful than ERM. The above result can also be expressed as follows:

$P\left ( \left \{S \sim D: R(h) \leq \widehat{R}_S(h) + \epsilon_i(m, w(i(h)) \cdot \delta) \right \} \right ) \geq 1-\delta, \forall h \in \mathcal{H}$

where $i(h)=\min_{i:h \in \mathcal{H}_i}$ is a function that finds the most sample-efficient class a hypothesis $h$ belongs. This result prescribes a searching rule for an extended set of hypotheses that satisfies PAC learnability. We can then use it to build a new learning algorithm

$h_{SRM} \in \arg \min_{h \in \mathcal{H}} R(h) \leq \widehat{R}_S(h) + \epsilon_i(m, w(i(h)) \cdot \delta)$

which we call Structured Risk Minimization (SRM). Remember that by Hoeffding's inequality we have $m_h^{UC}(\epsilon,\delta)=\frac{\log(2/\delta)}{2\epsilon^2}$ for a single hypothesis $h$, which implies $\epsilon_i(m,\delta) = \sqrt{\frac{\log(2/\delta)}{2m}}$. Defining $w'(h) := w(i(h))$ we get

$h_{SRM} \in \arg \min_{h \in \mathcal{H}} \widehat{R}_S(h) + \sqrt{\frac{-\log w'(h) + \log(2/\delta)}{2m}}$

where we use the trivial identity that $\log(2/\delta \cdot w'(h)) =-\log w'(h) + \log(2/\delta)$. In effect, SRM applies a preference on the hypothesis subclasses represented in the learning rule by the weight $w'(h)$. The higher the weight, the stronger the preference. These preferences can be chosen to reflect prior knowledge coming from the expertise in the domain regarding the collected data.

In many real-world problems, we have little domain knowledge to be incorporated into the learning process. It would be useful to have a guideline for building machine learning algorithms without any domain knowledge. Remember that the only condition to combine individually agnostic PAC learnable hypothesis classes into a bigger class is to define a series of positive weights that converge to a value less than one. The following inequality provides an instrumental way of building such a series.

Lemma 5.3 (Kraft's inequality) Let $\Sigma=\{0,1\}$ and $\Sigma^*$ be the list of all finite-length prefix-free strings created from $\Sigma=\{0,1\}$ in the sense that no string $\sigma \in \Sigma^*$ of size $k$ is identical to the first $k$ characters of a larger string. Then we have $\sum_{\sigma \in \Sigma^*} 1/2^{|\sigma|} \leq 1$ where $|\sigma|$ denotes the size of a string $\sigma$.

Hence, whenever we can find an alphabet that uniquely describes all hypotheses in a class with a string $\sigma(h)$, we can plug Kraft's inequality into Theorem 5.6 and get

$P\Big ( \Big \{S \sim D: R(h) \leq \widehat{R}_S(h) + \sqrt{\frac{|\sigma(h)| + \ln(2/\delta)}{2m}} \Big \} \Big ) \geq 1-\delta$

where we use the property $\ln(2) < 1$. The right-hand side of this inequality prescribes the learning rule below

$h_{MDL} \in \arg \min_{h \in \mathcal{H}} \widehat{R}_S(h) + \sqrt{\frac{|\sigma(h)| + \log(2/\delta)}{2m}}$.

Note that this rule suggests the hypothesis that can be expressed with minimum number of characters among the ones that fit the data best. Hence, it is referred to as the Minimum Description Length(MDL) rule. The MDL rule is yet another instance of the Occam's Razor principle in machine learning, which also has overarching consequences on hypothesis development in scientific investigation. This rule also sets a theoretical foundation for the idea of applying regularizers in machine learning.