Skip to main content

Learning Theory Core

Uniform Convergence

Uniform convergence of empirical risk to population risk over an entire hypothesis class: the key property that makes ERM provably work.

CoreTier 1StableCore spine~65 min

Why This Matters

theorem visual

ERM works when the whole landscape is close

Pointwise convergence controls one fixed hypothesis. Uniform convergence controls the worst gap over the whole hypothesis class, including the hypothesis selected after seeing data.

population riskempirical riskepsilon tubehypotheses in H

uniform gap

The important quantity is the worst mismatch across the class, not the mismatch at one fixed .

ERM transfer

If the two landscapes are close everywhere, the empirical minimizer cannot be much worse than the true minimizer.

excess risk

The two epsilon losses come from moving from true risk to empirical risk and back again.

You already know that ERM minimizes empirical risk R^n(h)\hat{R}_n(h) as a proxy for population risk R(h)R(h). But why should this work? If the empirical risk is close to the population risk for one particular hypothesis, that is not enough. ERM searches over the entire hypothesis class H\mathcal{H}, so we need the approximation to hold simultaneously for every hHh \in \mathcal{H}.

This is exactly what uniform convergence gives you. For any class H\mathcal{H} small enough that uniform convergence holds at rate ϵ\epsilon, "training looks good" provably implies "the model generalizes within 2ϵ2\epsilon" — it is the single most important conceptual bridge between ERM on a finite sample and population risk.

A caveat worth stating up front: uniform convergence is a sufficient route to generalization, not a necessary one. Modern learning theory has identified several alternatives that can certify generalization even when suphR^n(h)R(h)\sup_h |\hat{R}_n(h) - R(h)| is large or vacuous:

  • Algorithmic stability (Bousquet-Elisseeff 2002; Hardt-Recht-Singer 2016): if the learner's output changes little when one training point is swapped, generalization follows directly, without controlling the whole class.
  • PAC-Bayes (McAllester 1999; Catoni 2007; Dziugaite-Roy 2017): bounds the risk of a posterior over hypotheses in terms of a KL divergence to any data-independent prior, often giving non-vacuous bounds for deep networks where uniform convergence is vacuous.
  • Margin- and norm-based bounds (Bartlett 1998; Bartlett-Mendelson 2002; Bartlett-Foster-Telgarsky 2017; Neyshabur et al.\ 2017): control generalization via data-dependent complexities of the learned predictor rather than the worst case over H\mathcal{H}.
  • Benign / tempered overfitting (Belkin et al.\ 2019; Bartlett et al.
    2020; Mallinar et al.\ 2022): in highly overparameterized regimes some estimators interpolate noise yet still generalize, which Nagarajan-Kolter 2019 and Negrea et al.\ 2020 show cannot be explained by any uniform convergence bound over the relevant hypothesis class.
  • Implicit bias of optimization (Soudry et al.\ 2018; Gunasekar et al.
    2018): SGD restricts the effective hypothesis class to a much smaller data-dependent subset, so worst-case complexity over H\mathcal{H} is the wrong object.

In short: low training loss does not automatically mean nothing without uniform convergence — there are other routes — but uniform convergence remains the cleanest sufficient condition and the right starting point.

Mental Model

For a fixed hypothesis hh, the law of large numbers gives R^n(h)R(h)\hat{R}_n(h) \to R(h) as nn \to \infty. This is pointwise convergence and says nothing about the worst-case hHh \in \mathcal{H}. The estimator that ERM actually returns is the empirical minimizer, which is itself a function of the noise.

Uniform convergence is the stronger statement that the two functions hR^n(h)h \mapsto \hat{R}_n(h) and hR(h)h \mapsto R(h) are within ϵ\epsilon of each other simultaneously over all hHh \in \mathcal{H}. If that holds, the empirical minimizer is at most 2ϵ2\epsilon above the true minimizer — a triangle inequality on the noise. The hard work is showing when uniform convergence holds and at what rate; the answer depends on the complexity of H\mathcal{H} measured by VC dimension, Rademacher complexity, or covering numbers.

Formal Setup and Notation

We work in the standard supervised learning setting. D\mathcal{D} is a distribution over X×Y\mathcal{X} \times \mathcal{Y}, \ell is a loss function bounded in [0,1][0, 1], and S={(x1,y1),,(xn,yn)}S = \{(x_1, y_1), \ldots, (x_n, y_n)\} is drawn i.i.d. from D\mathcal{D}.

Definition

Uniform Convergence Property

A hypothesis class H\mathcal{H} has the uniform convergence property if and only if for every ϵ,δ>0\epsilon, \delta > 0, there exists m(ϵ,δ)m(\epsilon, \delta) such that for all nm(ϵ,δ)n \geq m(\epsilon, \delta) and all distributions D\mathcal{D}:

PrS ⁣[suphHR(h)R^n(h)>ϵ]<δ\Pr_S\!\left[\sup_{h \in \mathcal{H}} |R(h) - \hat{R}_n(h)| > \epsilon\right] < \delta

The crucial point is the sup\sup over H\mathcal{H}. The bound must hold for every hypothesis simultaneously, not just for one at a time.

Definition

Epsilon-Representative Sample

A training sample SS is ϵ\epsilon-representative with respect to H\mathcal{H}, \ell, and D\mathcal{D} if and only if:

hH:  R(h)R^n(h)ϵ\forall h \in \mathcal{H}: \; |R(h) - \hat{R}_n(h)| \leq \epsilon

In words: the empirical risk of every hypothesis in the class is within ϵ\epsilon of its population risk. When your sample is ϵ\epsilon-representative, you can trust empirical risk as a proxy for true risk.

Core Definitions

The distinction between pointwise and uniform convergence is the heart of this topic.

Pointwise convergence says: for each fixed hh, as nn \to \infty, R^n(h)R(h)\hat{R}_n(h) \to R(h) in probability. This follows directly from the law of large numbers and requires no assumptions on H\mathcal{H}.

Uniform convergence says: the worst-case deviation over the class vanishes:

suphHR(h)R^n(h)p0\sup_{h \in \mathcal{H}} |R(h) - \hat{R}_n(h)| \xrightarrow{p} 0

The gap between pointwise and uniform is the gap between "each hypothesis individually has small generalization error" and "the hypothesis selected by ERM has small generalization error." ERM selects hh based on the data, creating a dependence that pointwise convergence cannot handle.

The sample complexity of uniform convergence for H\mathcal{H} is the function mHUC(ϵ,δ)m_{\mathcal{H}}^{\text{UC}}(\epsilon, \delta). The smallest nn guaranteeing that SS is ϵ\epsilon-representative with probability at least 1δ1 - \delta.

Main Theorems

Lemma

Epsilon-Representative Implies ERM Works

Statement

If SS is ϵ\epsilon-representative with respect to H\mathcal{H}, \ell, and D\mathcal{D}, then the ERM hypothesis hS=argminhHR^n(h)h_S = \arg\min_{h \in \mathcal{H}} \hat{R}_n(h) satisfies:

R(hS)minhHR(h)+2ϵR(h_S) \leq \min_{h \in \mathcal{H}} R(h) + 2\epsilon

Intuition

If every hypothesis has empirical risk within ϵ\epsilon of its population risk, then minimizing empirical risk gets you within 2ϵ2\epsilon of the best possible population risk in the class. You lose ϵ\epsilon going from population to empirical (for the best hypothesis), and another ϵ\epsilon going back (for the ERM hypothesis).

Proof Sketch

Let h=argminhHR(h)h^* = \arg\min_{h \in \mathcal{H}} R(h). Since SS is ϵ\epsilon-representative:

R(hS)R^n(hS)+ϵR^n(h)+ϵR(h)+2ϵR(h_S) \leq \hat{R}_n(h_S) + \epsilon \leq \hat{R}_n(h^*) + \epsilon \leq R(h^*) + 2\epsilon

The first inequality uses R(hS)R^n(hS)ϵ|R(h_S) - \hat{R}_n(h_S)| \leq \epsilon. The second uses the fact that hSh_S minimizes empirical risk. The third uses R(h)R^n(h)ϵ|R(h^*) - \hat{R}_n(h^*)| \leq \epsilon.

Why It Matters

This is the only lemma you need to reduce the problem of "does ERM learn?" to "does uniform convergence hold?" It cleanly separates the statistical question (uniform convergence) from the algorithmic question (ERM).

Failure Mode

The factor of 2ϵ2\epsilon is tight. You cannot improve it to ϵ\epsilon without additional assumptions, because ERM can select a hypothesis whose empirical risk is artificially low.

Theorem

Uniform Convergence is Sufficient for Learnability

Statement

If H\mathcal{H} has the uniform convergence property with sample complexity mHUC(ϵ,δ)m_{\mathcal{H}}^{\text{UC}}(\epsilon, \delta), then H\mathcal{H} is agnostically PAC learnable by ERM with sample complexity:

mH(ϵ,δ)mHUC(ϵ/2,δ)m_{\mathcal{H}}(\epsilon, \delta) \leq m_{\mathcal{H}}^{\text{UC}}(\epsilon/2, \delta)

Intuition

Uniform convergence with accuracy ϵ/2\epsilon/2 gives an (ϵ/2)(\epsilon/2)-representative sample. By the previous lemma, ERM on such a sample achieves excess risk at most 2(ϵ/2)=ϵ2 \cdot (\epsilon/2) = \epsilon. So uniform convergence directly implies PAC learnability.

Proof Sketch

Set the uniform convergence guarantee at level ϵ/2\epsilon/2. With probability 1δ\geq 1 - \delta, the sample is (ϵ/2)(\epsilon/2)-representative. Apply the ϵ\epsilon-representative lemma to get R(hS)R(h)+ϵR(h_S) \leq R(h^*) + \epsilon.

Why It Matters

This is the fundamental theorem connecting uniform convergence to learning. It says: to prove ERM works, it suffices to prove uniform convergence. All classical generalization bounds (finite class, VC dimension, Rademacher complexity) work by establishing uniform convergence.

Failure Mode

Uniform convergence is sufficient but not necessary for learnability in general. There exist learnable classes where uniform convergence fails. but they require algorithms other than ERM. For ERM specifically in binary classification, the fundamental theorem of statistical learning theory shows uniform convergence is both necessary and sufficient.

Theorem

Uniform Convergence for Finite Classes

Statement

If H|\mathcal{H}| is finite, then for any ϵ,δ>0\epsilon, \delta > 0, with probability at least 1δ1 - \delta:

suphHR(h)R^n(h)logH+log(2/δ)2n\sup_{h \in \mathcal{H}} |R(h) - \hat{R}_n(h)| \leq \sqrt{\frac{\log|\mathcal{H}| + \log(2/\delta)}{2n}}

Equivalently, a sample of size nlogH+log(2/δ)2ϵ2n \geq \frac{\log|\mathcal{H}| + \log(2/\delta)}{2\epsilon^2} is ϵ\epsilon-representative with probability at least 1δ1 - \delta.

Intuition

Apply Hoeffding to each hypothesis independently, then union bound over the class. The key cost of the union bound is the logH\log|\mathcal{H}| term. you pay logarithmically in the size of the class, not linearly. This is why exponentially large hypothesis classes can still have reasonable sample complexity.

Proof Sketch

For any fixed hh, Hoeffding's inequality gives:

P(R(h)R^n(h)>ϵ)2exp(2nϵ2)P(|R(h) - \hat{R}_n(h)| > \epsilon) \leq 2\exp(-2n\epsilon^2)

By the union bound over all hHh \in \mathcal{H}:

P ⁣(suphHR(h)R^n(h)>ϵ)2Hexp(2nϵ2)P\!\left(\sup_{h \in \mathcal{H}} |R(h) - \hat{R}_n(h)| > \epsilon\right) \leq 2|\mathcal{H}|\exp(-2n\epsilon^2)

Set the right side equal to δ\delta and solve for ϵ\epsilon.

Why It Matters

This is the concrete instantiation of uniform convergence for the simplest case. It immediately implies ERM learns any finite hypothesis class with sample complexity O(logH/ϵ2)O(\log|\mathcal{H}|/\epsilon^2).

Failure Mode

Fails for infinite hypothesis classes: logH=\log|\mathcal{H}| = \infty. The union bound is too wasteful because it treats every hypothesis as independent, ignoring the correlation structure in H\mathcal{H}. VC dimension and Rademacher complexity exploit this structure.

Theorem

VC-Style Two-Sided Uniform Deviation Bound

Statement

For a hypothesis class with effective-class cardinality bounded by the Sauer-Shelah growth function kd(nk)\sum_{k \leq d} \binom{n}{k} (for both \ell and -\ell), with BB-bounded loss and an iid sample SμnS \sim \mu^n:

μn ⁣{SsuphR(h)R^n(h)2B2dlog(en/d)n+ε}2exp ⁣(ε2n8B2)\mu^n\!\left\{S \mid \sup_h |R(h) - \hat{R}_n(h)| \geq 2B\sqrt{\frac{2d\log(en/d)}{n}} + \varepsilon\right\} \leq 2\exp\!\left(-\frac{\varepsilon^2 n}{8B^2}\right)

This is the two-sided uniform deviation bound via the Rademacher route: Sauer-Shelah, Massart, Rademacher symmetrization, Azuma concentration, and union bound over the upper and lower tails.

Intuition

The VC parameter dd controls the effective complexity of the hypothesis class on the sample. The bound replaces logH\log |H| in the finite-class Hoeffding bound with dlog(en/d)d \log(en/d), which can be much smaller when the class has limited shattering power. The Azuma constant (8B28B^2 in the exponent) is a factor of 4 looser than the sharp McDiarmid constant.

Proof Ideas and Templates Used

The proofs in this topic use two key patterns:

  1. Triangle inequality on risks: the 2ϵ2\epsilon bound comes from chaining R(hS)R^n(hS)+ϵR(h_S) \leq \hat{R}_n(h_S) + \epsilon and R^n(hS)R^n(h)R(h)+ϵ\hat{R}_n(h_S) \leq \hat{R}_n(h^*) \leq R(h^*) + \epsilon. This is a purely deterministic argument once you have the ϵ\epsilon-representative property.

  2. Hoeffding + union bound: for finite classes, apply concentration to each hypothesis individually, then pay a logH\log|\mathcal{H}| penalty to make the bound simultaneous. This is the simplest proof template in learning theory. It breaks for infinite classes because the union bound over uncountably many hypotheses gives infinity.

For establishing uniform convergence for infinite classes, the tools are:

  • VC dimension: combinatorial complexity leading to polynomial growth
  • Rademacher complexity: data-dependent complexity via symmetrization
  • Covering numbers: metric entropy arguments

Canonical Examples

Example

Pointwise holds but uniform fails: the memorizer class

Let X=R\mathcal{X} = \mathbb{R}, Y={0,1}\mathcal{Y} = \{0, 1\}, and Hall={0,1}X\mathcal{H}_{\text{all}} = \{0, 1\}^{\mathcal{X}} (all binary functions). For any fixed hHallh \in \mathcal{H}_{\text{all}}, the law of large numbers gives R^n(h)R(h)\hat{R}_n(h) \to R(h). Pointwise convergence holds trivially.

But for any sample SS of size nn, there exists hSHallh_S \in \mathcal{H}_{\text{all}} that memorizes SS perfectly (R^n(hS)=0\hat{R}_n(h_S) = 0) while predicting incorrectly on all unseen data (R(hS)R(h_S) can be arbitrarily large). The ERM principle selects this memorizer, so suphR(h)R^n(h)\sup_h |R(h) - \hat{R}_n(h)| stays large.

Uniform convergence fails because the class is too rich. It has infinite VC dimension.

Example

Finite class: uniform convergence by union bound

If H=N|\mathcal{H}| = N and loss is in [0,1][0,1], Hoeffding gives for each hh: Pr[R(h)R^n(h)>ϵ]2e2nϵ2\Pr[|R(h) - \hat{R}_n(h)| > \epsilon] \leq 2e^{-2n\epsilon^2}.

Union bound: Pr[h:R(h)R^n(h)>ϵ]2Ne2nϵ2\Pr[\exists h: |R(h) - \hat{R}_n(h)| > \epsilon] \leq 2N e^{-2n\epsilon^2}.

Setting this δ\leq \delta and solving: nlog(2N/δ)2ϵ2n \geq \frac{\log(2N/\delta)}{2\epsilon^2}. So finite classes always have the uniform convergence property, with sample complexity O(logN/ϵ2)O(\log N / \epsilon^2). For N=106N = 10^6, ϵ=0.05\epsilon = 0.05, and δ=0.05\delta = 0.05:

nlog(2106/0.05)20.052=log(4107)0.0053,500.n \geq \frac{\log(2 \cdot 10^6 / 0.05)}{2 \cdot 0.05^2} = \frac{\log(4 \cdot 10^7)}{0.005} \approx 3{,}500.

Tightening the failure probability is cheap: δ=0.01\delta = 0.01 gives n3,820n \approx 3{,}820 and δ=0.001\delta = 0.001 gives n4,280n \approx 4{,}280. The log(1/δ)\log(1/\delta) dependence is the reason high-confidence learning costs only logarithmically more data.

Example

Threshold classifiers on R: infinite class, finite VC dimension

Let H={ht:tR}\mathcal{H} = \{h_t : t \in \mathbb{R}\} where ht(x)=1[xt]h_t(x) = \mathbf{1}[x \leq t]. This is an infinite class (H=|\mathcal{H}| = \infty), so the finite-class bound does not apply directly. But VCdim(H)=1\text{VCdim}(\mathcal{H}) = 1, and the growth function is ΠH(m)=m+1\Pi_{\mathcal{H}}(m) = m + 1. For any mm points, there are only m+1m + 1 distinct labelings (one for each interval between consecutive points, plus the two extremes). Replacing logH\log|\mathcal{H}| with log(m+1)\log(m + 1) in the symmetrization argument gives a finite uniform convergence bound despite the infinite class.

Common Confusions

Watch Out

Pointwise convergence is not enough for ERM

Students often think: "for each hh, the empirical risk converges to the population risk, so the minimum of the empirical risks converges to the minimum of the population risks." This is wrong. The min operation and the limit do not commute in general. The issue is that ERM selects which hh to evaluate based on the data, introducing dependence. Uniform convergence is precisely the condition that makes this swap valid.

Watch Out

Uniform convergence is not about uniform distributions

Despite sharing the word "uniform," these are unrelated concepts. Uniform convergence means the convergence bound holds simultaneously for all hHh \in \mathcal{H}. The "uniform" is over the hypothesis class, not over any probability distribution.

Watch Out

The factor of 2 in the excess risk bound is not an artifact

The factor of 2ϵ2\epsilon in R(hS)R(h)+2ϵR(h_S) \leq R(h^*) + 2\epsilon is real, not a proof artifact. You lose ϵ\epsilon when the best hypothesis appears worse than it is (its empirical risk overestimates its population risk), and another ϵ\epsilon when the ERM hypothesis appears better than it is (its empirical risk underestimates its population risk). Both directions of error contribute.

Summary

  • Pointwise convergence (R^n(h)R(h)\hat{R}_n(h) \to R(h) for each fixed hh) is free from the law of large numbers; uniform convergence (suphR(h)R^n(h)0\sup_h |R(h) - \hat{R}_n(h)| \to 0) is the hard part
  • An ϵ\epsilon-representative sample guarantees ERM achieves excess risk 2ϵ\leq 2\epsilon
  • To prove ERM works, it suffices to prove uniform convergence
  • Sample complexity for uniform convergence = sample complexity for ERM learning (up to constant factors)
  • Finite classes: uniform convergence holds with n=O(logH/ϵ2)n = O(\log|\mathcal{H}|/\epsilon^2)
  • Infinite classes require VC dimension or Rademacher complexity to establish uniform convergence

Exercises

ExerciseCore

Problem

Prove the ϵ\epsilon-representative lemma: if hH\forall h \in \mathcal{H}, R(h)R^n(h)ϵ|R(h) - \hat{R}_n(h)| \leq \epsilon, then R(hERM)R(h)+2ϵR(h_{\text{ERM}}) \leq R(h^*) + 2\epsilon. Write out each step explicitly and identify where the ERM property is used.

ExerciseCore

Problem

A hypothesis class has H=1012|\mathcal{H}| = 10^{12} (one trillion hypotheses). The loss is bounded in [0,1][0, 1]. How many i.i.d. samples nn do you need so that with probability at least 0.990.99, the sample is 0.010.01-representative?

ExerciseAdvanced

Problem

Construct an explicit example of an infinite hypothesis class H\mathcal{H} where pointwise convergence holds for every fixed hh but uniform convergence fails. That is, for every nn, there exists a distribution D\mathcal{D} such that suphHR(h)R^n(h)\sup_{h \in \mathcal{H}} |R(h) - \hat{R}_n(h)| does not converge to zero in probability.

Related Comparisons

References

Canonical:

Current:

  • Mohri, Rostamizadeh, Talwalkar, Foundations of Machine Learning (2018), Chapter 3
  • Nagarajan & Kolter, "Uniform convergence may be unable to explain generalization in deep learning," NeurIPS (2019), arXiv:1902.04742. Explicit constructions where every reasonable UC bound on the trained network class is provably loose.
  • Negrea, Dziugaite, Roy, "In Defense of Uniform Convergence: Generalization via Derandomization," ICML (2020), arXiv:1912.04265. Counter-defense: the class on which UC must hold is the (typically smaller) derandomized class, not the full hypothesis class.
  • Wainwright, High-Dimensional Statistics (2019), Chapters 4-6

Next Topics

From uniform convergence, the key question becomes: how do you establish it for infinite hypothesis classes?

  • VC dimension: a combinatorial measure that characterizes uniform convergence for binary classification
  • Rademacher complexity: a data-dependent measure that gives tighter, more general uniform convergence bounds

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

7

Derived topics

4