Uniform Convergence

Sneiderman, Robby

Learning Theory Core

Uniform Convergence

Uniform convergence of empirical risk to population risk over an entire hypothesis class: the key property that makes ERM provably work.

CoreTier 1StableCore spine~65 min

Prerequisites

Empirical Risk Minimization Adaptive Learning Is Not Iid Bernstein Inequality Realizability Assumption

Quiz (14)Pulse Check Prereq Map

Why This Matters

theorem visual

ERM works when the whole landscape is close

$Pointwise convergence controls one fixed hypothesis. Uniform convergence controls the worst gap over the whole hypothesis class, including the hypothesis selected after seeing data.$

uniform gap

$sup_{h \in H} ∣ R (h) - R_{n} (h) ∣$

$The important quantity is the worst mismatch across the class, not the mismatch at one fixed h .$

ERM transfer

$h = ar g min_{h} R_{n} (h)$

$If the two landscapes are close everywhere, the empirical minimizer cannot be much worse than the true minimizer.$

excess risk

$R (h) \leq in f_{h \in H} R (h) + 2 ϵ$

$The two epsilon losses come from moving from true risk to empirical risk and back again.$

You already know that ERM minimizes empirical risk $\hat{R}_n(h)$ as a proxy for population risk $R(h)$ . But why should this work? If the empirical risk is close to the population risk for one particular hypothesis, that is not enough. ERM searches over the entire hypothesis class $\mathcal{H}$ , so we need the approximation to hold simultaneously for every $h \in \mathcal{H}$ .

This is exactly what uniform convergence gives you. For any class $\mathcal{H}$ small enough that uniform convergence holds at rate $\epsilon$ , "training looks good" provably implies "the model generalizes within $2\epsilon$ " — it is the single most important conceptual bridge between ERM on a finite sample and population risk.

A caveat worth stating up front: uniform convergence is a sufficient route to generalization, not a necessary one. Modern learning theory has identified several alternatives that can certify generalization even when $\sup_h |\hat{R}_n(h) - R(h)|$ is large or vacuous:

Algorithmic stability (Bousquet-Elisseeff 2002; Hardt-Recht-Singer 2016): if the learner's output changes little when one training point is swapped, generalization follows directly, without controlling the whole class.
PAC-Bayes (McAllester 1999; Catoni 2007; Dziugaite-Roy 2017): bounds the risk of a posterior over hypotheses in terms of a KL divergence to any data-independent prior, often giving non-vacuous bounds for deep networks where uniform convergence is vacuous.
Margin- and norm-based bounds (Bartlett 1998; Bartlett-Mendelson 2002; Bartlett-Foster-Telgarsky 2017; Neyshabur et al.\ 2017): control generalization via data-dependent complexities of the learned predictor rather than the worst case over $\mathcal{H}$ .
Benign / tempered overfitting (Belkin et al.\ 2019; Bartlett et al.
2020; Mallinar et al.\ 2022): in highly overparameterized regimes some estimators interpolate noise yet still generalize, which Nagarajan-Kolter 2019 and Negrea et al.\ 2020 show cannot be explained by any uniform convergence bound over the relevant hypothesis class.
Implicit bias of optimization (Soudry et al.\ 2018; Gunasekar et al.
2018): SGD restricts the effective hypothesis class to a much smaller data-dependent subset, so worst-case complexity over $\mathcal{H}$ is the wrong object.

In short: low training loss does not automatically mean nothing without uniform convergence — there are other routes — but uniform convergence remains the cleanest sufficient condition and the right starting point.

Mental Model

For a fixed hypothesis $h$ , the law of large numbers gives $\hat{R}_n(h) \to R(h)$ as $n \to \infty$ . This is pointwise convergence and says nothing about the worst-case $h \in \mathcal{H}$ . The estimator that ERM actually returns is the empirical minimizer, which is itself a function of the noise.

Uniform convergence is the stronger statement that the two functions $h \mapsto \hat{R}_n(h)$ and $h \mapsto R(h)$ are within $\epsilon$ of each other simultaneously over all $h \in \mathcal{H}$ . If that holds, the empirical minimizer is at most $2\epsilon$ above the true minimizer — a triangle inequality on the noise. The hard work is showing when uniform convergence holds and at what rate; the answer depends on the complexity of $\mathcal{H}$ measured by VC dimension, Rademacher complexity, or covering numbers.

Formal Setup and Notation

We work in the standard supervised learning setting. $\mathcal{D}$ is a distribution over $\mathcal{X} \times \mathcal{Y}$ , $\ell$ is a loss function bounded in $[0, 1]$ , and $S = \{(x_1, y_1), \ldots, (x_n, y_n)\}$ is drawn i.i.d. from $\mathcal{D}$ .

Definition

Uniform Convergence Property $sup_{h \in H} ∣ R (h) - \hat{R}_{n} (h) ∣ \leq ϵ$

A hypothesis class $\mathcal{H}$ has the uniform convergence property if and only if for every $\epsilon, \delta > 0$ , there exists $m(\epsilon, \delta)$ such that for all $n \geq m(\epsilon, \delta)$ and all distributions $\mathcal{D}$ :

$\Pr_S\!\left[\sup_{h \in \mathcal{H}} |R(h) - \hat{R}_n(h)| > \epsilon\right] < \delta$

The crucial point is the $\sup$ over $\mathcal{H}$ . The bound must hold for every hypothesis simultaneously, not just for one at a time.

Definition

Epsilon-Representative Sample $ϵ -representative$

A training sample $S$ is $\epsilon$ -representative with respect to $\mathcal{H}$ , $\ell$ , and $\mathcal{D}$ if and only if:

$\forall h \in \mathcal{H}: \; |R(h) - \hat{R}_n(h)| \leq \epsilon$

In words: the empirical risk of every hypothesis in the class is within $\epsilon$ of its population risk. When your sample is $\epsilon$ -representative, you can trust empirical risk as a proxy for true risk.

Core Definitions

The distinction between pointwise and uniform convergence is the heart of this topic.

Pointwise convergence says: for each fixed $h$ , as $n \to \infty$ , $\hat{R}_n(h) \to R(h)$ in probability. This follows directly from the law of large numbers and requires no assumptions on $\mathcal{H}$ .

Uniform convergence says: the worst-case deviation over the class vanishes:

$\sup_{h \in \mathcal{H}} |R(h) - \hat{R}_n(h)| \xrightarrow{p} 0$

The gap between pointwise and uniform is the gap between "each hypothesis individually has small generalization error" and "the hypothesis selected by ERM has small generalization error." ERM selects $h$ based on the data, creating a dependence that pointwise convergence cannot handle.

The sample complexity of uniform convergence for $\mathcal{H}$ is the function $m_{\mathcal{H}}^{\text{UC}}(\epsilon, \delta)$ . The smallest $n$ guaranteeing that $S$ is $\epsilon$ -representative with probability at least $1 - \delta$ .

Main Theorems

Lemma

Epsilon-Representative Implies ERM Works

Statement

If $S$ is $\epsilon$ -representative with respect to $\mathcal{H}$ , $\ell$ , and $\mathcal{D}$ , then the ERM hypothesis $h_S = \arg\min_{h \in \mathcal{H}} \hat{R}_n(h)$ satisfies:

$R(h_S) \leq \min_{h \in \mathcal{H}} R(h) + 2\epsilon$

Intuition

If every hypothesis has empirical risk within $\epsilon$ of its population risk, then minimizing empirical risk gets you within $2\epsilon$ of the best possible population risk in the class. You lose $\epsilon$ going from population to empirical (for the best hypothesis), and another $\epsilon$ going back (for the ERM hypothesis).

Proof Sketch

Let $h^* = \arg\min_{h \in \mathcal{H}} R(h)$ . Since $S$ is $\epsilon$ -representative:

$R(h_S) \leq \hat{R}_n(h_S) + \epsilon \leq \hat{R}_n(h^*) + \epsilon \leq R(h^*) + 2\epsilon$

The first inequality uses $|R(h_S) - \hat{R}_n(h_S)| \leq \epsilon$ . The second uses the fact that $h_S$ minimizes empirical risk. The third uses $|R(h^*) - \hat{R}_n(h^*)| \leq \epsilon$ .

Why It Matters

This is the only lemma you need to reduce the problem of "does ERM learn?" to "does uniform convergence hold?" It cleanly separates the statistical question (uniform convergence) from the algorithmic question (ERM).

Failure Mode

The factor of $2\epsilon$ is tight. You cannot improve it to $\epsilon$ without additional assumptions, because ERM can select a hypothesis whose empirical risk is artificially low.

report a correction →

Theorem

Uniform Convergence is Sufficient for Learnability

Statement

If $\mathcal{H}$ has the uniform convergence property with sample complexity $m_{\mathcal{H}}^{\text{UC}}(\epsilon, \delta)$ , then $\mathcal{H}$ is agnostically PAC learnable by ERM with sample complexity:

$m_{\mathcal{H}}(\epsilon, \delta) \leq m_{\mathcal{H}}^{\text{UC}}(\epsilon/2, \delta)$

Intuition

Uniform convergence with accuracy $\epsilon/2$ gives an $(\epsilon/2)$ -representative sample. By the previous lemma, ERM on such a sample achieves excess risk at most $2 \cdot (\epsilon/2) = \epsilon$ . So uniform convergence directly implies PAC learnability.

Proof Sketch

Set the uniform convergence guarantee at level $\epsilon/2$ . With probability $\geq 1 - \delta$ , the sample is $(\epsilon/2)$ -representative. Apply the $\epsilon$ -representative lemma to get $R(h_S) \leq R(h^*) + \epsilon$ .

Why It Matters

This is the fundamental theorem connecting uniform convergence to learning. It says: to prove ERM works, it suffices to prove uniform convergence. All classical generalization bounds (finite class, VC dimension, Rademacher complexity) work by establishing uniform convergence.

Failure Mode

Uniform convergence is sufficient but not necessary for learnability in general. There exist learnable classes where uniform convergence fails. but they require algorithms other than ERM. For ERM specifically in binary classification, the fundamental theorem of statistical learning theory shows uniform convergence is both necessary and sufficient.

report a correction →

Theorem

Uniform Convergence for Finite Classes

Statement

If $|\mathcal{H}|$ is finite, then for any $\epsilon, \delta > 0$ , with probability at least $1 - \delta$ :

$\sup_{h \in \mathcal{H}} |R(h) - \hat{R}_n(h)| \leq \sqrt{\frac{\log|\mathcal{H}| + \log(2/\delta)}{2n}}$

Equivalently, a sample of size $n \geq \frac{\log|\mathcal{H}| + \log(2/\delta)}{2\epsilon^2}$ is $\epsilon$ -representative with probability at least $1 - \delta$ .

Intuition

Apply Hoeffding to each hypothesis independently, then union bound over the class. The key cost of the union bound is the $\log|\mathcal{H}|$ term. you pay logarithmically in the size of the class, not linearly. This is why exponentially large hypothesis classes can still have reasonable sample complexity.

Proof Sketch

For any fixed $h$ , Hoeffding's inequality gives:

$P(|R(h) - \hat{R}_n(h)| > \epsilon) \leq 2\exp(-2n\epsilon^2)$

By the union bound over all $h \in \mathcal{H}$ :

$P\!\left(\sup_{h \in \mathcal{H}} |R(h) - \hat{R}_n(h)| > \epsilon\right) \leq 2|\mathcal{H}|\exp(-2n\epsilon^2)$

Set the right side equal to $\delta$ and solve for $\epsilon$ .

Why It Matters

This is the concrete instantiation of uniform convergence for the simplest case. It immediately implies ERM learns any finite hypothesis class with sample complexity $O(\log|\mathcal{H}|/\epsilon^2)$ .

Failure Mode

Fails for infinite hypothesis classes: $\log|\mathcal{H}| = \infty$ . The union bound is too wasteful because it treats every hypothesis as independent, ignoring the correlation structure in $\mathcal{H}$ . VC dimension and Rademacher complexity exploit this structure.

report a correction →

Theorem

VC-Style Two-Sided Uniform Deviation Bound

Statement

For a hypothesis class with effective-class cardinality bounded by the Sauer-Shelah growth function $\sum_{k \leq d} \binom{n}{k}$ (for both $\ell$ and $-\ell$ ), with $B$ -bounded loss and an iid sample $S \sim \mu^n$ :

$\mu^n\!\left\{S \mid \sup_h |R(h) - \hat{R}_n(h)| \geq 2B\sqrt{\frac{2d\log(en/d)}{n}} + \varepsilon\right\} \leq 2\exp\!\left(-\frac{\varepsilon^2 n}{8B^2}\right)$

This is the two-sided uniform deviation bound via the Rademacher route: Sauer-Shelah, Massart, Rademacher symmetrization, Azuma concentration, and union bound over the upper and lower tails.

Intuition

The VC parameter $d$ controls the effective complexity of the hypothesis class on the sample. The bound replaces $\log |H|$ in the finite-class Hoeffding bound with $d \log(en/d)$ , which can be much smaller when the class has limited shattering power. The Azuma constant ( $8B^2$ in the exponent) is a factor of 4 looser than the sharp McDiarmid constant.

report a correction →

Proof Ideas and Templates Used

The proofs in this topic use two key patterns:

Triangle inequality on risks: the $2\epsilon$ bound comes from chaining $R(h_S) \leq \hat{R}_n(h_S) + \epsilon$ and $\hat{R}_n(h_S) \leq \hat{R}_n(h^*) \leq R(h^*) + \epsilon$ . This is a purely deterministic argument once you have the $\epsilon$ -representative property.
Hoeffding + union bound: for finite classes, apply concentration to each hypothesis individually, then pay a $\log|\mathcal{H}|$ penalty to make the bound simultaneous. This is the simplest proof template in learning theory. It breaks for infinite classes because the union bound over uncountably many hypotheses gives infinity.

For establishing uniform convergence for infinite classes, the tools are:

VC dimension: combinatorial complexity leading to polynomial growth
Rademacher complexity: data-dependent complexity via symmetrization
Covering numbers: metric entropy arguments

Canonical Examples

Example

Pointwise holds but uniform fails: the memorizer class

Let $\mathcal{X} = \mathbb{R}$ , $\mathcal{Y} = \{0, 1\}$ , and $\mathcal{H}_{\text{all}} = \{0, 1\}^{\mathcal{X}}$ (all binary functions). For any fixed $h \in \mathcal{H}_{\text{all}}$ , the law of large numbers gives $\hat{R}_n(h) \to R(h)$ . Pointwise convergence holds trivially.

But for any sample $S$ of size $n$ , there exists $h_S \in \mathcal{H}_{\text{all}}$ that memorizes $S$ perfectly ( $\hat{R}_n(h_S) = 0$ ) while predicting incorrectly on all unseen data ( $R(h_S)$ can be arbitrarily large). The ERM principle selects this memorizer, so $\sup_h |R(h) - \hat{R}_n(h)|$ stays large.

Uniform convergence fails because the class is too rich. It has infinite VC dimension.

Example

Finite class: uniform convergence by union bound

If $|\mathcal{H}| = N$ and loss is in $[0,1]$ , Hoeffding gives for each $h$ : $\Pr[|R(h) - \hat{R}_n(h)| > \epsilon] \leq 2e^{-2n\epsilon^2}$ .

Union bound: $\Pr[\exists h: |R(h) - \hat{R}_n(h)| > \epsilon] \leq 2N e^{-2n\epsilon^2}$ .

Setting this $\leq \delta$ and solving: $n \geq \frac{\log(2N/\delta)}{2\epsilon^2}$ . So finite classes always have the uniform convergence property, with sample complexity $O(\log N / \epsilon^2)$ . For $N = 10^6$ , $\epsilon = 0.05$ , and $\delta = 0.05$ :

$n \geq \frac{\log(2 \cdot 10^6 / 0.05)}{2 \cdot 0.05^2} = \frac{\log(4 \cdot 10^7)}{0.005} \approx 3{,}500.$

Tightening the failure probability is cheap: $\delta = 0.01$ gives $n \approx 3{,}820$ and $\delta = 0.001$ gives $n \approx 4{,}280$ . The $\log(1/\delta)$ dependence is the reason high-confidence learning costs only logarithmically more data.

Example

Threshold classifiers on R: infinite class, finite VC dimension

Let $\mathcal{H} = \{h_t : t \in \mathbb{R}\}$ where $h_t(x) = \mathbf{1}[x \leq t]$ . This is an infinite class ( $|\mathcal{H}| = \infty$ ), so the finite-class bound does not apply directly. But $\text{VCdim}(\mathcal{H}) = 1$ , and the growth function is $\Pi_{\mathcal{H}}(m) = m + 1$ . For any $m$ points, there are only $m + 1$ distinct labelings (one for each interval between consecutive points, plus the two extremes). Replacing $\log|\mathcal{H}|$ with $\log(m + 1)$ in the symmetrization argument gives a finite uniform convergence bound despite the infinite class.

Common Confusions

Watch Out

Pointwise convergence is not enough for ERM

Students often think: "for each $h$ , the empirical risk converges to the population risk, so the minimum of the empirical risks converges to the minimum of the population risks." This is wrong. The min operation and the limit do not commute in general. The issue is that ERM selects which $h$ to evaluate based on the data, introducing dependence. Uniform convergence is precisely the condition that makes this swap valid.

Watch Out

Uniform convergence is not about uniform distributions

Despite sharing the word "uniform," these are unrelated concepts. Uniform convergence means the convergence bound holds simultaneously for all $h \in \mathcal{H}$ . The "uniform" is over the hypothesis class, not over any probability distribution.

Watch Out

The factor of 2 in the excess risk bound is not an artifact

The factor of $2\epsilon$ in $R(h_S) \leq R(h^*) + 2\epsilon$ is real, not a proof artifact. You lose $\epsilon$ when the best hypothesis appears worse than it is (its empirical risk overestimates its population risk), and another $\epsilon$ when the ERM hypothesis appears better than it is (its empirical risk underestimates its population risk). Both directions of error contribute.

Summary

Pointwise convergence ( $\hat{R}_n(h) \to R(h)$ for each fixed $h$ ) is free from the law of large numbers; uniform convergence ( $\sup_h |R(h) - \hat{R}_n(h)| \to 0$ ) is the hard part
An $\epsilon$ -representative sample guarantees ERM achieves excess risk $\leq 2\epsilon$
To prove ERM works, it suffices to prove uniform convergence
Sample complexity for uniform convergence = sample complexity for ERM learning (up to constant factors)
Finite classes: uniform convergence holds with $n = O(\log|\mathcal{H}|/\epsilon^2)$
Infinite classes require VC dimension or Rademacher complexity to establish uniform convergence

Exercises

ExerciseCore

Problem

Prove the $\epsilon$ -representative lemma: if $\forall h \in \mathcal{H}$ , $|R(h) - \hat{R}_n(h)| \leq \epsilon$ , then $R(h_{\text{ERM}}) \leq R(h^*) + 2\epsilon$ . Write out each step explicitly and identify where the ERM property is used.

ExerciseCore

Problem

A hypothesis class has $|\mathcal{H}| = 10^{12}$ (one trillion hypotheses). The loss is bounded in $[0, 1]$ . How many i.i.d. samples $n$ do you need so that with probability at least $0.99$ , the sample is $0.01$ -representative?

ExerciseAdvanced

Problem

Construct an explicit example of an infinite hypothesis class $\mathcal{H}$ where pointwise convergence holds for every fixed $h$ but uniform convergence fails. That is, for every $n$ , there exists a distribution $\mathcal{D}$ such that $\sup_{h \in \mathcal{H}} |R(h) - \hat{R}_n(h)|$ does not converge to zero in probability.

Related Comparisons

Pointwise vs. Uniform Convergence

References

Canonical:

Shalev-Shwartz & Ben-David, Understanding Machine Learning, Chapters 4-6
Vapnik, Statistical Learning Theory (1998), Chapter 3

Current:

Mohri, Rostamizadeh, Talwalkar, Foundations of Machine Learning (2018), Chapter 3
Nagarajan & Kolter, "Uniform convergence may be unable to explain generalization in deep learning," NeurIPS (2019), arXiv:1902.04742. Explicit constructions where every reasonable UC bound on the trained network class is provably loose.
Negrea, Dziugaite, Roy, "In Defense of Uniform Convergence: Generalization via Derandomization," ICML (2020), arXiv:1912.04265. Counter-defense: the class on which UC must hold is the (typically smaller) derandomized class, not the full hypothesis class.
Wainwright, High-Dimensional Statistics (2019), Chapters 4-6

Next Topics

From uniform convergence, the key question becomes: how do you establish it for infinite hypothesis classes?

VC dimension: a combinatorial measure that characterizes uniform convergence for binary classification
Rademacher complexity: a data-dependent measure that gives tighter, more general uniform convergence bounds

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

7

Sets, Functions, and Relationslayer 0A · tier 1
Bernstein Inequalitylayer 2 · tier 1
Empirical Risk Minimizationlayer 2 · tier 1
Realizability Assumptionlayer 2 · tier 1
Basic Logic and Proof Techniqueslayer 0A · tier 2

Derived topics

4

PAC Learning Frameworklayer 1 · tier 1
VC Dimensionlayer 2 · tier 1
Rademacher Complexitylayer 3 · tier 1
Glivenko-Cantelli Theoremlayer 2 · tier 2

Graph-backed continuations

VC Dimension Rademacher Complexity Glivenko-Cantelli Theorem PAC Learning Framework