Skip to main content

Learning Theory Core

PAC Learning Framework

The foundational formalization of what it means to learn from data: a concept is PAC-learnable if and only if an algorithm can, with high probability, find a hypothesis that is approximately correct, using a polynomial number of samples.

CoreTier 1StableCore spine~40 min

Why This Matters

The PAC (Probably Approximately Correct) framework, introduced by Leslie Valiant in 1984, is the mathematical foundation for answering the most basic question in machine learning: when is it possible to learn from data?

Hide overviewShow overview
Five-panel infographic: Probably Approximately Correct learning core idea, formal setup (instance space, concept class, distribution, sample, learner, hypothesis), the (epsilon, delta) guarantee, sample complexity scaling polynomially in 1/epsilon and log(1/delta), and why PAC matters (ERM connection, VC dimension, Rademacher).
PAC asks: how many random examples does a learner need before it is likely to return a low-error hypothesis?

theorem visual

PAC learning turns samples into a guarantee

The learner does not need to identify the true rule exactly. It needs enough samples so the returned hypothesis is probably within an acceptable error band.

sampleiid data from Dlearneralgorithm Ahhypothesisnot enough samplesPAC guarantee active

probably

The training sample is random, so the guarantee may fail on unlucky samples.

approximately

The target is low error, not exact recovery of the hidden rule.

sample complexity

The theory asks how many examples make the guarantee reliable.

Before PAC theory, machine learning was a collection of algorithms without a unifying theory of learnability. PAC provides the definitions and theorems that tell you whether a learning problem is solvable in principle, and how many samples you need.

Every generalization bound you encounter, from ERM to VC dimension to Rademacher complexity, is ultimately answering a question that PAC theory formalized. The sample complexity bounds page covers the quantitative details.

theorem visual

Accuracy is expensive near zero error

PAC bounds are forgiving about confidence but harsh about accuracy: driving epsilon down usually costs a quadratic factor in sample size.

target error epsilonsamples neededstrictloosersmall classmedium classlarge class

finite-class shape

Complexity enters through the class size term. Accuracy enters through the denominator, so halving the target error can require about four times as many samples.

Mental Model

Think of a learner as a detective. The detective wants to identify an unknown rule (the target concept) but can only observe examples. PAC theory asks: how many examples does the detective need to be probably (1δ1 - \delta) approximately (ϵ\epsilon) correct about the rule?

The "probably" accounts for the randomness of the training sample. you might get an unlucky draw. The "approximately" accounts for the fact that you do not need to identify the rule exactly; getting close is enough.

Formal Setup and Notation

Let X\mathcal{X} be an instance space (e.g., {0,1}n\{0,1\}^n). A concept is a function c:X{0,1}c: \mathcal{X} \to \{0, 1\}. A concept class C\mathcal{C} is a collection of concepts.

There is an unknown distribution D\mathcal{D} over X\mathcal{X} and an unknown target concept cCc^* \in \mathcal{C}.

The learner receives a training sample

S={(x1,c(x1)),,(xn,c(xn))}S = \{(x_1, c^*(x_1)), \ldots, (x_n, c^*(x_n))\}

where each xiDx_i \sim \mathcal{D} independently.

The learner outputs a hypothesis h:X{0,1}h: \mathcal{X} \to \{0, 1\}.

Definition

Generalization Error

The generalization error (or risk) of a hypothesis hh with respect to target cc^* and distribution D\mathcal{D} is:

R(h)=PrxD[h(x)c(x)]R(h) = \Pr_{x \sim \mathcal{D}}[h(x) \neq c^*(x)]

This is the probability that hh disagrees with cc^* on a random example.

Core Definitions

Definition

PAC Learnability (Realizable Case)

A concept class C\mathcal{C} is PAC-learnable if and only if there exists an algorithm A\mathcal{A} and a function nC:(0,1)2Nn_{\mathcal{C}}: (0,1)^2 \to \mathbb{N} such that for every target concept cCc^* \in \mathcal{C}, every distribution D\mathcal{D} over X\mathcal{X}, every accuracy parameter ϵ(0,1)\epsilon \in (0, 1), and every confidence parameter δ(0,1)\delta \in (0, 1):

when A\mathcal{A} is given a sample SS of size nnC(ϵ,δ)n \geq n_{\mathcal{C}}(\epsilon, \delta) drawn i.i.d. from D\mathcal{D} and labeled by cc^*, it outputs hh satisfying:

PrS[R(h)ϵ]1δ\Pr_S[R(h) \leq \epsilon] \geq 1 - \delta

The class is polynomial-sample PAC-learnable if nC(ϵ,δ)n_{\mathcal{C}}(\epsilon, \delta) is polynomial in 1/ϵ1/\epsilon and 1/δ1/\delta. It is efficiently PAC-learnable (also called polynomial-time PAC-learnable) if, in addition, A\mathcal{A} runs in time polynomial in 1/ϵ1/\epsilon, 1/δ1/\delta, and size(c)\text{size}(c^*). This follows the standard usage in computational learning theory (Valiant 1984, Kearns and Vazirani 1994), where "efficient" always refers to time, not sample, complexity.

The key features of this definition:

  1. For all distributions. The learner must work regardless of D\mathcal{D}
  2. For all target concepts. The learner must work for any cCc^* \in \mathcal{C}
  3. Probably: the guarantee holds with probability 1δ\geq 1 - \delta
  4. Approximately: the error is at most ϵ\epsilon, not necessarily zero
Definition

Sample Complexity

The sample complexity nC(ϵ,δ)n_{\mathcal{C}}(\epsilon, \delta) is the minimum number of training examples needed to achieve error ϵ\leq \epsilon with confidence 1δ\geq 1 - \delta. This is the fundamental quantity that PAC theory characterizes.

Realizable vs. Agnostic PAC

Definition

Realizable Setting

In the realizable setting, we assume cCc^* \in \mathcal{C}. The target concept is in our hypothesis class. The learner can achieve zero training error. The question is whether zero training error implies low test error.

Definition

Agnostic PAC Learning

In the agnostic setting, we make no assumption about the target. Labels may be generated by any joint distribution over X×{0,1}\mathcal{X} \times \{0,1\}. The goal is to compete with the best hypothesis in C\mathcal{C}:

PrS ⁣[R(h)minhCR(h)+ϵ]1δ\Pr_S\!\left[R(h) \leq \min_{h' \in \mathcal{C}} R(h') + \epsilon\right] \geq 1 - \delta

This is harder because even the best hypothesis in C\mathcal{C} may have nonzero error (the approximation error). Agnostic PAC learning requires more samples.

The agnostic setting is more realistic: in practice, you never know if the true relationship is exactly representable by your model class.

What breaks when realizability is dropped. Under realizability, the single-hypothesis tail used Pr[h consistent](1ϵ)n\Pr[h \text{ consistent}] \leq (1 - \epsilon)^n, which is a Bernoulli-on-clean-flips event: each independent example has probability ϵ\geq \epsilon of exposing a bad hh. In the agnostic setting there is no hh with zero population error to anchor against, so the relevant question becomes "how close is R^n(h)\hat{R}_n(h) to R(h)R(h) uniformly?" That is a concentration question, answered by Hoeffding: for a single bounded hh, Pr[R^n(h)R(h)>ϵ]2e2nϵ2\Pr[|\hat{R}_n(h) - R(h)| > \epsilon] \leq 2 e^{-2n\epsilon^2}. The exponent flips from nϵ-n\epsilon to nϵ2-n\epsilon^2, and inverting it gives n=O(log(M/δ)/ϵ2)n = O(\log(M/\delta)/\epsilon^2) instead of O(log(M/δ)/ϵ)O(\log(M/\delta)/\epsilon). The square is the unavoidable cost of estimating an unknown mean rather than detecting a single bad consistent hypothesis. See the realizable vs. agnostic comparison for the side-by-side derivation.

Main Theorems

Theorem

PAC Bound for Finite Hypothesis Classes

Statement

Let C\mathcal{C} be a finite concept class with C=M|\mathcal{C}| = M. In the realizable setting, the ERM algorithm (which outputs any hCh \in \mathcal{C} consistent with the training data) satisfies: for any ϵ,δ>0\epsilon, \delta > 0, if the sample size satisfies:

n1ϵ(logM+log1δ)n \geq \frac{1}{\epsilon}\left(\log M + \log \frac{1}{\delta}\right)

then with probability 1δ\geq 1 - \delta, the output hh has R(h)ϵR(h) \leq \epsilon.

Intuition

Each "bad" hypothesis (one with error >ϵ> \epsilon) has probability at most (1ϵ)nenϵ(1 - \epsilon)^n \leq e^{-n\epsilon} of being consistent with all nn training examples. There are at most MM bad hypotheses. By a union bound, the probability that any bad hypothesis survives is at most MenϵM e^{-n\epsilon}. Setting this δ\leq \delta and solving for nn gives the bound.

Proof Sketch

The proof is a clean three-step union-bound argument. We give every step explicitly because this template recurs throughout statistical learning theory.

Step 1: Single-hypothesis tail. Fix one hypothesis hCh \in \mathcal{C} with R(h)>ϵR(h) > \epsilon. Call hh bad. For one random example (xi,c(xi))(x_i, c^*(x_i)) with xiDx_i \sim \mathcal{D}, the probability that h(xi)=c(xi)h(x_i) = c^*(x_i) is at most 1ϵ1 - \epsilon (because hh disagrees with cc^* on at least an ϵ\epsilon fraction of D\mathcal{D}). Independence across the nn training points gives:

PrS[h consistent with S]=i=1nPr[h(xi)=c(xi)](1ϵ)n\Pr_S[h \text{ consistent with } S] = \prod_{i=1}^n \Pr[h(x_i) = c^*(x_i)] \leq (1 - \epsilon)^n

The convexity inequality 1xex1 - x \leq e^{-x} for x0x \geq 0 (the tangent line to exe^{-x} at x=0x = 0) tightens this to:

PrS[h consistent with S](1ϵ)nenϵ\Pr_S[h \text{ consistent with } S] \leq (1 - \epsilon)^n \leq e^{-n\epsilon}

Step 2: Union bound over the bad set. Let Cbad={hC:R(h)>ϵ}\mathcal{C}_{\text{bad}} = \{h \in \mathcal{C} : R(h) > \epsilon\}, with CbadC=M|\mathcal{C}_{\text{bad}}| \leq |\mathcal{C}| = M. ERM fails when some bad hypothesis is consistent with SS (because then ERM may return it). The union bound gives:

PrS ⁣[hCbad:h consistent with S]hCbadenϵMenϵ\Pr_S\!\left[\exists h \in \mathcal{C}_{\text{bad}} : h \text{ consistent with } S\right] \leq \sum_{h \in \mathcal{C}_{\text{bad}}} e^{-n\epsilon} \leq M e^{-n\epsilon}

The factor of MM is the price for searching over the whole class. Note it enters additively across hypotheses, which is what allows the logarithm to absorb it in the next step.

Step 3: Solve for nn. We want the failure probability δ\leq \delta:

Menϵδ    enϵδM    nϵlogMδM e^{-n\epsilon} \leq \delta \iff e^{-n\epsilon} \leq \frac{\delta}{M} \iff n\epsilon \geq \log\frac{M}{\delta}

Rearranging gives the sample complexity:

n1ϵ(logM+log1δ)n \geq \frac{1}{\epsilon}\left(\log M + \log\frac{1}{\delta}\right)

The logarithmic dependence on MM is the key. A class with 2d2^d hypotheses needs only O(d/ϵ)O(d/\epsilon) samples, not O(2d/ϵ)O(2^d/\epsilon). This is what makes ERM tractable even for combinatorially large finite classes. For infinite classes, logM=\log M = \infty and this argument fails; VC theory replaces logM\log M with the VC dimension dVCd_{\text{VC}} via symmetrization and the Sauer-Shelah lemma.

Why It Matters

This is the simplest PAC bound and the template for all subsequent results. The sample complexity is O(1ϵlogM)O(\frac{1}{\epsilon}\log M). logarithmic in the size of the hypothesis class. This logarithmic dependence is what makes learning from finite classes tractable even when MM is exponentially large.

Failure Mode

This bound requires the realizable assumption (cCc^* \in \mathcal{C}). In the agnostic case, the rate changes from O(1/ϵ)O(1/\epsilon) to O(1/ϵ2)O(1/\epsilon^2). The square is the cost of not knowing the target is in your class.

Theorem

Fundamental Theorem of PAC Learning

Statement

For binary classification with 0-1 loss, the following are equivalent:

  1. C\mathcal{C} has finite VC dimension
  2. C\mathcal{C} is agnostic PAC-learnable
  3. C\mathcal{C} is PAC-learnable (realizable case)
  4. ERM over C\mathcal{C} is a successful agnostic PAC learner

Moreover, the sample complexity of agnostic PAC learning is:

nagnostic=Θ ⁣(dVC+log(1/δ)ϵ2)n_{\text{agnostic}} = \Theta\!\left(\frac{d_{\text{VC}} + \log(1/\delta)}{\epsilon^2}\right)

and in the realizable case (where some cCc^* \in \mathcal{C} has zero risk):

nrealizable=O ⁣(dVClog(1/ϵ)+log(1/δ)ϵ)n_{\text{realizable}} = O\!\left(\frac{d_{\text{VC}} \log(1/\epsilon) + \log(1/\delta)}{\epsilon}\right)

where dVCd_{\text{VC}} is the VC dimension of C\mathcal{C}. The realizable case scales as 1/ϵ1/\epsilon rather than 1/ϵ21/\epsilon^2, the same quadratic- versus-linear gap that appeared in the finite-class bound, now expressed in dVCd_{\text{VC}} instead of logC\log|\mathcal{C}|. The log(1/ϵ)\log(1/\epsilon) factor in the realizable upper bound is removable for proper realizable learners (Hanneke 2016, JMLR), giving the optimal Θ((dVC+log(1/δ))/ϵ)\Theta((d_{\text{VC}} + \log(1/\delta))/\epsilon) rate.

Intuition

The VC dimension is the exact measure of hypothesis class complexity that determines learnability. A class is learnable if and only if it cannot shatter arbitrarily large sets of points. The VC dimension replaces the crude logM\log M of the finite case with a geometric measure of complexity that works for infinite classes.

Why It Matters

This is the most important theorem in statistical learning theory. It completely characterizes PAC learnability: finite VC dimension is both necessary and sufficient. Every other complexity measure (Rademacher complexity, covering numbers, fat-shattering dimension) can be related back to this fundamental characterization.

Failure Mode

This theorem applies only to binary classification with 0-1 loss. For regression, multi-class classification, or other loss functions, the characterization is more complex and involves different complexity measures.

Proof Ideas and Templates Used

The proofs in PAC theory repeatedly use two techniques:

  1. Union bound + pointwise concentration: Bound the probability that a single hypothesis is misleading (pointwise), then take a union bound over all hypotheses. This works for finite classes.

  2. Symmetrization + growth function: For infinite classes, the union bound over hypotheses is replaced by a union bound over behaviors of hypotheses on a ghost sample. The number of distinct behaviors is bounded by the growth function ΠC(n)\Pi_{\mathcal{C}}(n), which is polynomial in nn when the VC dimension is finite (Sauer's lemma).

Canonical Examples

Example

Learning conjunctions

The class of conjunctions over nn Boolean variables contains 3n3^n hypotheses (each variable can appear positive, negative, or absent). The PAC bound gives sample complexity O(nϵlog3)=O(nϵ)O(\frac{n}{\epsilon}\log 3) = O(\frac{n}{\epsilon}). There is also a simple polynomial-time algorithm: start with all literals and remove any literal falsified by a positive example.

Example

Learning intervals on the real line

The class of intervals [a,b]R[a, b] \subset \mathbb{R} has VC dimension 2. By the fundamental theorem, the sample complexity for agnostic PAC learning is O(1/ϵ2)O(1/\epsilon^2), independent of the number of possible intervals (which is uncountably infinite).

Worked Example: Numeric Walkthrough

Take a concrete tier of accuracy targets and compare what the realizable finite-class bound, the VC realizable bound, and the agnostic VC bound actually demand. Fix dVC=10d_{\text{VC}} = 10 (a small parametric class such as a 10-dimensional halfspace family with margin), and set δ=0.05\delta = 0.05 throughout.

Realizable finite-class bound (from n1ϵ(logM+log(1/δ))n \ge \frac{1}{\epsilon}(\log M + \log(1/\delta)), with M=2dVC=1024M = 2^{d_{\text{VC}}} = 1024):

ϵ\epsilonlogM\log Mlog(1/δ)\log(1/\delta)Sample size nn
0.106.933.0099
0.056.933.00199
0.016.933.00993

The bound scales as 1/ϵ1/\epsilon. Halving ϵ\epsilon doubles the sample requirement.

Agnostic VC bound (from n=Θ((dVC+log(1/δ))/ϵ2)n = \Theta((d_{\text{VC}} + \log(1/\delta))/\epsilon^2), dropping Θ\Theta constants):

ϵ\epsilondVC+log(1/δ)d_{\text{VC}} + \log(1/\delta)Sample size nn
0.10131300
0.05135200
0.0113130000

The agnostic bound scales as 1/ϵ21/\epsilon^2. Halving ϵ\epsilon quadruples the sample requirement. At ϵ=0.01\epsilon = 0.01, the agnostic case demands 130×\sim 130\times more data than the realizable finite-class bound — the quadratic price of not knowing whether the target is in your class. This factor blows up further as ϵ\epsilon shrinks, which is why agnostic guarantees at very small error tolerances are expensive in practice.

The ratio of agnostic-to-realizable sample complexity is Θ(1/ϵ)\Theta(1/\epsilon) — a ratio that grows as you demand smaller error. At ϵ=0.001\epsilon = 0.001, the agnostic bound is already 1000×\sim 1000\times larger than the realizable bound. This gap is the formal content of the agnostic-vs-realizable comparison.

Lower Bounds: The Bound Is Tight

The upper bounds above are not loose. For binary classification, the agnostic VC sample complexity is tight up to logarithmic factors: there exist matching lower bounds.

Theorem

Sample-Complexity Lower Bound for VC Classes

Statement

For every PAC learning algorithm A\mathcal{A} and every concept class C\mathcal{C} with dVC(C)=dd_{\text{VC}}(\mathcal{C}) = d, there exist distributions and target concepts on which A\mathcal{A} requires at least

ncd+log(1/δ)ϵ2n \ge c \cdot \frac{d + \log(1/\delta)}{\epsilon^2}

samples to achieve Pr[R(h)ϵ]1δ\Pr[R(h) \le \epsilon] \ge 1 - \delta in the agnostic setting, for an absolute constant c>0c > 0.

Intuition

The lower bound is constructed by exhibiting a hard distribution: pick dd shattered points and randomize their labels. Any algorithm with too few samples must miss at least one of the 2d2^d possible labellings, and for that labelling its hypothesis is forced to error Θ(ϵ)\Theta(\epsilon). The 1/ϵ21/\epsilon^2 rate emerges from the variance of empirical-risk estimates on the hard distribution; it cannot be improved without distribution-specific assumptions.

Why It Matters

This says the agnostic PAC sample complexity is tight up to constants and logarithmic factors. You cannot beat 1/ϵ21/\epsilon^2 in the worst case, no matter what algorithm you design. Any improvement must come from distribution-specific structure (margin, low noise, smoothness) — the content of fast-rate generalization theory.

Failure Mode

The lower bound requires an adversarial distribution. Real data has structure (margin, smoothness, low noise) that allows faster rates. The 1/ϵ21/\epsilon^2 floor is for the worst-case PAC framework, not for any specific distribution.

The realizable PAC bound has an analogous matching lower bound at n=Ω(d/ϵ)n = \Omega(d/\epsilon), also up to log factors. So the entire PAC framework is essentially complete: VC dimension determines sample complexity, and the bounds are tight.

Connection to Uniform Convergence

The PAC framework and the uniform convergence framework are two sides of the same coin. PAC asks: how many samples does a learner need? Uniform convergence asks: how fast does the empirical risk converge to the true risk uniformly over the hypothesis class? The two are equivalent for binary classification with 0-1 loss.

The bridge: a class is agnostic PAC-learnable if and only if its loss class is a Glivenko-Cantelli class — exactly the classes for which suphRn(h)R(h)0\sup_h |R_n(h) - R(h)| \to 0 uniformly. ERM is a successful agnostic PAC learner because uniform convergence ensures the empirical-risk minimizer is close to the population-risk minimizer.

For finite classes: uniform convergence follows from a union bound and Hoeffding's inequality. For infinite classes with finite VC dimension: uniform convergence follows from symmetrization plus the Sauer-Shelah polynomial growth bound on the number of distinct hypothesis behaviors on nn points. Both routes yield the same O(1/ϵ2)O(1/\epsilon^2) agnostic rate.

Common Confusions

Watch Out

PAC learnability is distribution-free

The PAC guarantee must hold for every distribution D\mathcal{D}. This is a very strong requirement. It means the sample complexity bounds are worst-case over distributions. For specific, well-behaved distributions, you might need far fewer samples. This is both the strength (universality) and weakness (conservatism) of PAC theory.

Watch Out

Efficient learnability requires polynomial time

PAC learnability as defined above only requires polynomial sample complexity. It says nothing about computational efficiency. A class can be PAC-learnable (finite VC dimension) but not efficiently PAC-learnable (finding the ERM hypothesis is NP-hard). For example, learning intersections of halfspaces is PAC-learnable but believed to be computationally hard.

Watch Out

The realizable assumption is very strong

Assuming cCc^* \in \mathcal{C} means you know the true concept is representable by your hypothesis class. In practice, this is almost never true: real models always face some degree of overfitting. The agnostic setting is more realistic but requires O(1/ϵ2)O(1/\epsilon^2) samples instead of O(1/ϵ)O(1/\epsilon). The quadratic cost of agnosticism is a fundamental price you pay for not knowing the true model.

Summary

  • PAC = Probably (1δ1-\delta) Approximately (ϵ\epsilon) Correct
  • Sample complexity for finite class (realizable): n=O(1ϵlogC)n = O(\frac{1}{\epsilon}\log|\mathcal{C}|)
  • Sample complexity for finite VC dimension (agnostic): n=O(dVCϵ2)n = O(\frac{d_{\text{VC}}}{\epsilon^2})
  • The fundamental theorem: finite VC dimension is equivalent to PAC learnability for binary classification
  • PAC is distribution-free: bounds hold for the worst-case distribution
  • Computational complexity is a separate question from sample complexity

Exercises

ExerciseCore

Problem

A concept class C\mathcal{C} has 256 concepts. In the realizable setting, how many samples do you need to guarantee that with probability at least 0.90.9, the ERM hypothesis has error at most 0.050.05?

ExerciseCore

Problem

Explain in one sentence why the sample complexity for the agnostic case is O(1/ϵ2)O(1/\epsilon^2) while the realizable case is O(1/ϵ)O(1/\epsilon).

ExerciseAdvanced

Problem

The class of all Boolean functions on nn bits has 22n2^{2^n} concepts. What is the PAC sample complexity in the realizable case? Is this class efficiently PAC-learnable?

Related Comparisons

References

Canonical:

  • Valiant, "A Theory of the Learnable," Communications of the ACM (1984)
  • Shalev-Shwartz & Ben-David, Understanding Machine Learning, Chapters 2-6
  • Kearns & Vazirani, An Introduction to Computational Learning Theory (1994), Chapters 1-3

Current:

  • Mohri, Rostamizadeh, Talwalkar, Foundations of Machine Learning (2018), Chapter 2
  • Wainwright, High-Dimensional Statistics (2019), Chapters 4-6
  • Bartlett & Mendelson, "Rademacher and Gaussian Complexities: Risk Bounds and Structural Results," JMLR 3 (2002), 463-482. The Rademacher route to PAC bounds; often tighter than the VC route and the standard reference for data-dependent generalization analysis. jmlr.org
  • Hopkins, Kane, Lovett, Mahajan, "Realizable Learning is All You Need," COLT (2022), arXiv:2111.04746 — proves that for any concept class, agnostic learnability is equivalent to realizable learnability up to a polynomial sample-complexity blowup
  • Brukhim, Carmon, Dinur, Moran, Yehudayoff, "A Characterization of Multiclass Learnability," FOCS (2022), arXiv:2203.01550 — closes the multiclass case with the DS dimension, settling a 30-year open problem about which multiclass classes are PAC-learnable

Next Topics

The natural next steps from PAC learning:

  • Empirical risk minimization: the algorithmic principle that implements PAC learning
  • VC dimension: the complexity measure that characterizes PAC learnability

Last reviewed: May 4, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Derived topics

5