ML Theory: PAC Learnability

The Multi-Class Classification Problem: Let’s say, we have a domain $\mathcal{X}$ where each element $x = (x_{1}, x_{2}, \cdots, x_{k})$ is a $k$ size vector of features, a distribution $\mathcal{D}$ over $\mathcal{X}$, and a label set $\mathcal{Y} = \{1, 2, \cdots, n\}$ so that there exists a function $c: \mathcal{X} \rightarrow \mathcal{Y}$. But we do not have any knowledge of the true distribution $\mathcal{D}$ and the function $c$. But we do have a finite sample $S = ((x_{1}, y_{1}), (x_{2}, y_{2}), \cdots, (x_{m}, y_{m}))$ where each pair is an input output mapping from $\mathcal{X} \times \mathcal{Y}$, i.e., $c(x_{i}) = y_{i}$ and some prior knowledge of the problem at hand.

This prior knowledge of the problem is represented by a set of hypotheses called the Hypothesis Class $\mathcal{H}$. Every hypothesis $h \in \mathcal{H}$ is a conjectured solution to estimating the function $c$. The original function $c$ in this case, is called a concept. We are trying to learn the concept $c$ and $h$ is what we think of $c$. With the same domain and distribution, we could have a different concept $\bar c$ to learn and all of these different choices of $c$ creates the Concept Class we are trying to learn.

Generalization Risk: Given a hpyothesis $h \in \mathcal{H}$, a target concept $c$ in concept class $\mathcal{C}$, and a distribution $\mathcal{D}$ over domain $\mathcal{X}$, then the generalization risk or generalization error or true error of the hypothesis $h$ is defined to be,

$ R_{\mathcal{D}}(h) = Pr_{x \sim D} [h(x) \neq c(x)]$

Empirical Risk: Given a hpyothesis $h \in \mathcal{H}$, a target concept $c$ in concept class $\mathcal{C}$, and a sample $S = ((x_{1}, y_{1}), (x_{2}, y_{2}), \cdots, (x_{m}, y_{m}))$, then the empirical risk of the hypothesis $h$ with respect to this sample $S$ is defined to be,

$ R_{S}(h) = \frac{1}{m} \sum_{i = 1}^{m} \mathcal{I}\{h(x_{i}) \neq c(x_{i})\}$

The empirical risk is an unbiased estimator of the true risk. A hypothesis $h$ is said to be consistent with a sample $S$ if $R_{S}(h) = 0$.

Empirical Risk Minimization Algorithm: Given a sample $S$ and a hypothesis class $\mathcal{H}$, a learning algorithm tries to come up with a hypothesis that is supposed to be least error-prone. But since it typically has no knowledge about the distribution $\mathcal{D}$, so, an available notion of error is the empirical risk with respect to the given sample. The algorithm tries to find a hypothesis that minimizes the empirical risk in sample $S$. Such an ERM (Empirical Risk Minimization) algorithm could find a hypothesis that is specially designed to minimize the empirical risk in this sample but does very poor globally because the sample may not be a good representative of the whole domain and distribution.

ERM with Inductive Bias: With some prior knowledge of the problem, we could construct $\mathcal{H}$ in such a way that, it doesn’t lead to overfitting. The sample could be misleading. But by being biased to a set of hypotheses that were constructed without knowing the sample, we increase the chance of better learning.

The Realizability Assumption: It is the assumption that the concept $c$ we are trying to learn is contained in $\mathcal{H}$. So, $c$ will be consistent with any sample $S$ drawn from the distribution.

PAC Learning: A concept class $\mathcal{C}$ is said to be PAC-learnable if there exists an algorithm $\mathcal{A}$ which for any $\epsilon > 0$ and $\delta > 0$, returns a hypothesis $h_{S}$ from a hypothesis class $\mathcal{H}$ so that with at least $1-\delta$ confidence we can say that $R_{\mathcal{D}}(h_{S}) \leq \epsilon$ if it’s run on any identically and independently drawn sample of size $m$ where $m_{\mathcal{H}}$ is a polynomial function and $m \geq m_{\mathcal{H}}(\epsilon, \delta)$ and if $\mathcal{H}$ contains the target concept $c$.

PAC means probably approximately correct. We say, that $h_{S}$ is probably correct because the confidence $1-\delta$ can be close to one for near $0$ values of $\delta$ and it’s approximately correct because the error is less than $\epsilon$. This learning framework is due to Harvard Professor Leslie Valiant.

PAC-Learning Guarantees for Finite Hypothesis Classes with Realizability Assumption: Suppose, $\mathcal{H}$ is a finite hypothesis class that contains the target concept $c$, then for any $\delta > 0$ and $\epsilon > 0$ and any identically and independently drawn sample dataset $S$ with size $m \geq \frac{1}{\epsilon}(\log{|H|} - \log{\delta})$ and any learning algorithm $\mathcal{A}$ that returns a consistent hypothesis, it can be said with confidence at least $1 - \delta$ that $\mathcal{A}$ will return a hypothesis $h_{S}$ so that, $R_{\mathcal{D}}(h_{S}) \leq \epsilon$.

Proof: Let’s find the probability of the bad event where there exists a hypothesis that is consistent with the training sample but has error greater than $\epsilon$ in the true error.

$Pr[\exists h \in \mathcal{H} : R_{S}(h) = 0 \wedge R_{\mathcal{D}}(h) > \epsilon]$ $= Pr[\bigcup\limits_{h \in \mathcal{H}}^{} (R_{S}(h) = 0 \wedge R_{\mathcal{D}}(h) > \epsilon)]$ $\leq \sum_{h \in \mathcal{H}}^{} Pr[(R_{S}(h) = 0 \wedge R_{\mathcal{D}}(h) > \epsilon)]$ $\leq \sum_{h \in \mathcal{H}}^{} Pr[(R_{S}(h) = 0 | R_{\mathcal{D}}(h) > \epsilon)]$ $\leq \sum_{h \in \mathcal{H}}^{} (1 - \epsilon)^{m}$ $= |H| (1 - \epsilon)^{m} = \delta $

Solving for $m$, we find that, $m = \frac{1}{\epsilon}(\log{|\mathcal{H}|} - \log{\delta})$,
Since, $Pr[ \forall h \in \mathcal{H}, R_{\mathcal{D}}(h) \leq \epsilon | R_{S}(h) = 0 ] + Pr[\exists h \in \mathcal{H} : R_{S}(h) = 0 \wedge R_{\mathcal{D}}(h) > \epsilon] = 1$
So, $Pr[ \forall h \in \mathcal{H}, R_{\mathcal{D}}(h) \leq \epsilon | R_{S}(h) = 0 ] \geq 1 - \delta$ for $m_{\mathcal{H}}(\epsilon,\delta) = \frac{1}{\epsilon}(\log{|\mathcal{H}|} - \log{\delta})$

Scopes of Generalization:

Hypothesis class $\mathcal{H}$ may not contain the target concept $c$. So, we may waive the realizability assumption.
There may not even be a deterministic concept $c$ at all. The concept we are trying to learn could be stochastic. So instead of using a deterministic function $c$, we may want to zip the dataset $\mathcal{X}$ and the label set $\mathcal{Y}$ together as $\mathcal{Z} = \mathcal{X} \times \mathcal{Y}$
We may want to generalize beyond multi-class classification. So, we need to make adjustments for all kinds of loss functions. So, if $\mathcal{l}(h,z)$ denotes the loss of hypothesis $h$ against $z$, then the true risk of a hypothesis with respect to a probability distribution $\mathcal{D}$ is $R_{\mathcal{D}}(h) = \mathbb{E}_{z \sim \mathcal{D}}[\mathcal{l}(h,z)]$

Agnostic PAC Learning: A concept class $\mathcal{C}$ is agnostic PAC learnable with respect to set $Z$ and loss function $\mathcal{l}: \mathcal{H} \times {\mathcal{Z}} \rightarrow \mathbb{R_{+}}$, if there exists a polynomial function $m_{H}:(0,1)^{2} \rightarrow \mathbb{N}$ and a learning algorithm $\mathcal{A}$ with the following property: For every $\epsilon$, $\delta$, $\in (0,1)$, and for every distribution $\mathcal{D}$ over $\mathcal{Z}$, when running $\mathcal{A}$ on a sample of at least $m_{\mathcal{H}}(\epsilon, \delta)$ examples generated by $\mathcal{D}$ such that, with probability at least $1-\delta$, we can say that, $R_{\mathcal{D}}(h) \leq min_{\bar h \in \mathcal{H}} R_{\mathcal{D}}(\bar h) + \epsilon$ where $R_{\mathcal{D}}(h) = \mathbb{E}_{z \sim \mathcal{D}}[\mathcal{l}(h,z)]$.

Agnostic PAC Learning Guarantees for Finite Hypothesis Class Without Realizability Assumption: We need to show that,

$Pr[\forall h \in \mathcal{H}: |R_{S}(h) - R_{\mathcal{D}}(h)| \leq \epsilon] \geq 1 - \delta$

So, instead we can show that,

$Pr[\exists h \in \mathcal{H}: |R_{S}(h) - R_{\mathcal{D}}(h)| > \epsilon] \leq \delta$ $Pr[\exists h \in \mathcal{H}: |R_{S}(h) - R_{\mathcal{D}}(h)| > \epsilon]$ $= Pr[\bigcup\limits_{h \in \mathcal{H}}^{} {|R_{S}(h) - R_{\mathcal{D}}(h)| > \epsilon} ]$ $= \sum_{h \in \mathcal{H}}^{} Pr[{|R_{S}(h) - R_{\mathcal{D}}(h)| > \epsilon} ]$

But since, $\mathbb{E}[R_{S}(h)] = R_{\mathcal{D}}(h)$, so, using Hoeffding’s Inequality, we get,

$Pr[{|R_{S}(h) - R_{\mathcal{D}}(h)| > \epsilon} ] \leq 2 e^{-2m\epsilon^{2}}$

So,

$Pr[\exists h \in \mathcal{H}: |R_{S}(h) - R_{\mathcal{D}}(h)| > \epsilon] \leq 2 |\mathcal{H}| e^{-2m\epsilon^{2}} = \delta$

for $m_{\mathcal{H}}(\epsilon, \delta) = \frac{\log{\frac{2|\mathcal{H}|}{\delta}}}{2\epsilon^{2}}$.
Generalization bound - Fixed Hypothesis: For a hypothesis $h \in \mathcal{H}$, the following inequality holds with probability at least $1 - \delta$:

$R_{\mathcal{D}}(h) \leq R_{\mathcal{S}}(h) + \sqrt{\frac{\log{\frac{2}{\delta}}}{2m}}$

Generalization bound - All Hypothesis: For a hypothesis set $\mathcal{H}$, the following inequality holds with probability at least $1 - \delta$ for all $h$:

$R_{\mathcal{D}}(h) \leq R_{\mathcal{S}}(h) + \sqrt{\frac{\log{\frac{2|\mathcal{H}|}{\delta}}}{2m}}$

Estimation and Approximation Error: Let’s say, there exists a hypothesis $h^{*}$ which may or may not be present in $\mathcal{H}$ that incurs the lowest risk, then the relative loss of a hypothesis $h$ is given by,

$R_{\mathcal{D}}(h) - R_{\mathcal{D}}(h^{*}) = (R_{\mathcal{D}}(h) - R_{\mathcal{D}}(\bar h)) + (R_{\mathcal{D}}(\bar h) - R_{\mathcal{D}}(h^{*}))$

Where $\bar h$ is the best hypothesis in $\mathcal{H}$ that minimizes the true error.
The first summand on the right hand side is called estimation error and the second summand is called approximation error.