Skip to main content

Learning Theory Core

Empirical Risk Minimization

The foundational principle of statistical learning: minimize average loss on training data as a proxy for minimizing true population risk.

CoreTier 1StableCore spine~60 min

Why This Matters

Many core supervised learning methods are empirical risk minimizers, or regularized variants thereof. When you train a neural network by minimizing cross-entropy loss on a training set, you are doing ERM. When you fit a linear regression by minimizing squared error, you are doing ERM. Other supervised methods sit outside the ERM frame: kk-nearest neighbors and Nadaraya-Watson kernel regression are local averaging rules with no global empirical-risk objective; full Bayesian posterior predictives average over a posterior rather than minimize a fixed loss; conformal prediction wraps a base learner and is calibrated on hold-out data instead of fitted by minimization. ERM is the dominant frame, not a universal one.

Hide overviewShow overview
Five-panel infographic: ERM as picking the lowest-empirical-risk hypothesis as a proxy for true risk, the population vs empirical risk distinction, the ERM rule (argmin over H), generalization route via uniform convergence, and worked examples (squared loss, logistic, NLL, SGD).
ERM turns 'low training loss' into a bet about future data. Generalization theory studies when that bet is justified.

Understanding ERM is the first step toward understanding why learning from data works at all, and where it fails.

theorem visual

ERM turns training loss into a bet about future data

ERM chooses the hypothesis with the lowest empirical risk. Generalization theory asks when that training-set winner is also close to the population winner.

training samplefuture distributiontest on futuregapERM winner on samplesmall uniform gap makes it trustworthy

population target

The risk we care about, but cannot compute directly because the data distribution is unknown.

training proxy

The computable average loss on the sample.

ERM guarantee

If this is small uniformly over the class, minimizing training loss transfers to future data.

Mental Model

Imagine you want to find a function hh that predicts well on future, unseen data. You cannot directly optimize over data you have not seen. So instead, you optimize over the data you have: your training set. And hope that good performance on training data translates to good performance on new data.

ERM formalizes this hope. The central question of learning theory is: when and why does this hope actually work?

Formal Setup and Notation

Let X\mathcal{X} be an input space and Y\mathcal{Y} an output space. Let D\mathcal{D} be an unknown probability distribution over X×Y\mathcal{X} \times \mathcal{Y}.

We have a loss function :Y×Y[0,)\ell: \mathcal{Y} \times \mathcal{Y} \to [0, \infty) measuring prediction quality.

Notation reconciliation. TheoremPath uses R(h)R(h) for population risk and R^n(h)\hat{R}_n(h) for empirical risk. Shalev-Shwartz and Ben-David use L(D,f)(h)L_{(\mathcal{D}, f)}(h) and LS(h)L_S(h) respectively, with the second subscript ff encoding the realizable target. The statements coincide when the loss is 0-1 and ff is the labeling function: L(D,f)(h)=PxD[h(x)f(x)]L_{(\mathcal{D}, f)}(h) = \mathbb{P}_{x \sim \mathcal{D}}[h(x) \ne f(x)] matches R(h)R(h), and LS(h)={i:h(xi)yi}/mL_S(h) = |\{i : h(x_i) \ne y_i\}| / m matches R^n(h)\hat{R}_n(h) on training data labelled by ff. We use the SSBD notation when the realizability assumption is the analytical anchor; otherwise the R/R^nR / \hat{R}_n form. The proof structure is the same; only the symbols change.

The phrase inductive bias from SSBD refers to the prior choice of hypothesis class H\mathcal{H} that constrains the search space before any data is seen. Restricting H\mathcal{H} is what protects ERM from overfitting on finite samples; the realizability assumption and the choice of class together determine when the protection is enough.

Definition

Population Risk

The population risk (or true risk) of a hypothesis hh is:

R(h)=E(x,y)D[(h(x),y)]R(h) = \mathbb{E}_{(x,y) \sim \mathcal{D}}[\ell(h(x), y)]

This is the quantity we actually want to minimize, but we cannot compute it directly because D\mathcal{D} is unknown.

Definition

Empirical Risk

Given a training sample S={(x1,y1),,(xn,yn)}S = \{(x_1, y_1), \ldots, (x_n, y_n)\} drawn i.i.d. from D\mathcal{D}, the empirical risk is:

R^n(h)=1ni=1n(h(xi),yi)\hat{R}_n(h) = \frac{1}{n} \sum_{i=1}^{n} \ell(h(x_i), y_i)

This is computable. It is the average loss on training data.

Definition

Empirical Risk Minimization

Given a hypothesis class H\mathcal{H}, the ERM principle selects:

hERM=argminhHR^n(h)h_{\text{ERM}} = \arg\min_{h \in \mathcal{H}} \hat{R}_n(h)

The key question: when is R(hERM)R(h_{\text{ERM}}) close to minhHR(h)\min_{h \in \mathcal{H}} R(h)?

Core Definitions

The generalization gap is the difference between population risk and empirical risk:

R(h)R^n(h)R(h) - \hat{R}_n(h)

If this gap is small uniformly over all hHh \in \mathcal{H}, then minimizing empirical risk also approximately minimizes population risk.

The approximation error is:

minhHR(h)R\min_{h \in \mathcal{H}} R(h) - R^*

where R=infhR(h)R^* = \inf_h R(h) is the Bayes risk. This measures how well the best hypothesis in H\mathcal{H} can do, regardless of sample size.

The estimation error is:

R(hERM)minhHR(h)R(h_{\text{ERM}}) - \min_{h \in \mathcal{H}} R(h)

This is what ERM theory actually controls.

Main Theorems

Theorem

ERM Generalization via Uniform Convergence

Statement

If H|\mathcal{H}| is finite and the loss is bounded in [0,1][0,1], then for any δ>0\delta > 0, with probability at least 1δ1 - \delta over the draw of SS:

suphHR(h)R^n(h)logH+log(2/δ)2n\sup_{h \in \mathcal{H}} |R(h) - \hat{R}_n(h)| \leq \sqrt{\frac{\log|\mathcal{H}| + \log(2/\delta)}{2n}}

Intuition

The bound says: if the hypothesis class is not too large (measured by logH\log|\mathcal{H}|) and the sample is large enough, then empirical risk uniformly approximates population risk. The ERM hypothesis therefore has population risk close to the best in class.

Proof Sketch

Apply Hoeffding's inequality to each hHh \in \mathcal{H} to get R(h)R^n(h)ϵ|R(h) - \hat{R}_n(h)| \leq \epsilon with probability 12e2nϵ2\geq 1 - 2e^{-2n\epsilon^2}. Then take a union bound over all H|\mathcal{H}| hypotheses. Solve for ϵ\epsilon.

Why It Matters

This is the simplest generalization bound in learning theory. It shows why ERM works for finite hypothesis classes. The dependence on logH\log|\mathcal{H}| rather than H|\mathcal{H}| is crucial. It means the sample complexity grows only logarithmically in the number of hypotheses.

Failure Mode

This bound is vacuous for infinite hypothesis classes (e.g., all linear classifiers). You need more sophisticated complexity measures: VC dimension, Rademacher complexity, or covering numbers.

Lean-verified finite-class ERM tail

The Lean proof covers the finite-class Hoeffding chain: pointwise tail, finite-class union bound, closed-form rearrangement, sample-complexity bound, and ERM oracle inequality composed into a single excess-risk tail statement. Broader ERM claims for infinite or algorithm-specific hypothesis families stay source-checked until matching Lean statements are present.

Theorem

Finite-class Hoeffding ERM excess-risk tail (Lean-verified)

Statement

For a finite nonempty hypothesis class H\mathcal{H} with iid loss bounded in [a,b][a, b], an exact ERM hypothesis h^\hat{h} chosen on the empirical risk, and a best-in-class hypothesis hargminhHR(h)h^* \in \arg\min_{h \in \mathcal{H}} R(h), if

n    (ba)2log(2H/δ)2ε2n \;\geq\; \frac{(b-a)^2 \,\log(2|\mathcal{H}|/\delta)}{2\,\varepsilon^2}

then

μΩ({ω:2ε    R(h^(ω))R(h)})    δ.\mu_\Omega\bigl(\bigl\{\omega : 2\varepsilon \;\leq\; R(\hat{h}(\omega)) - R(h^*)\bigr\}\bigr) \;\leq\; \delta.

Proof Sketch

Apply the closed-form Hoeffding tail to the uniform deviation, then the ERM oracle inequality R(h^)R(h)+2uniformDevR(\hat{h}) \leq R(h^*) + 2\,\mathrm{uniformDev}, then algebra on the bad event. The Lean proof is in TheoremPath.LearningTheory.ERMGeneralization.finiteClass_hoeffding_exactERM_excessRisk_tail (see the Lean manifest entry claim:empirical-risk-minimization::finite-class-hoeffding-erm-excess-risk-tail). The chain depends only on propext, Classical.choice, and Quot.sound.

Failure Mode

Lean-verified finite-class Hoeffding ERM excess-risk tail. This does not verify VC bounds, Rademacher bounds, infinite hypothesis classes, or generic PAC learnability. The page-level claim that "ERM works" still requires source review for the standard textbook treatment; the bound above is one specific tail statement on one specific hypothesis class shape.

Theorem

Finite-class Hoeffding approximate-ERM excess-risk tail (Lean-verified)

Statement

Same setup as above but with h^\hat{h} only an approximate ERM with optimization slack η0\eta \geq 0 (i.e., R^n(h^)R^n(h)+η\hat{R}_n(\hat{h}) \leq \hat{R}_n(h) + \eta for all hHh \in \mathcal{H}). Then

μΩ({ω:2ε+η    R(h^(ω))R(h)})    δ.\mu_\Omega\bigl(\bigl\{\omega : 2\varepsilon + \eta \;\leq\; R(\hat{h}(\omega)) - R(h^*)\bigr\}\bigr) \;\leq\; \delta.

Proof Sketch

The optimization slack η\eta enters additively from the approximate-ERM oracle inequality R(h^)R(h)+2uniformDev+ηR(\hat{h}) \leq R(h^*) + 2\,\mathrm{uniformDev} + \eta. Same Hoeffding tail as the exact case. Lean proof: TheoremPath.LearningTheory.ERMGeneralization.finiteClass_hoeffding_approxERM_excessRisk_tail (manifest entry claim:empirical-risk-minimization::finite-class-hoeffding-approx-erm-excess-risk-tail).

Failure Mode

Same scope limits as the exact-ERM tail. Tying η\eta to a concrete optimization algorithm (e.g., SGD convergence rate) is a separate verification step that has not been done.

Theorem

Effective-Growth VC-Style ERM Excess-Risk Tail

Statement

For a hypothesis class with effective-class cardinality bounded by the Sauer-Shelah growth function kd(nk)\sum_{k \leq d} \binom{n}{k} (for both \ell and -\ell), BB-bounded loss, iid sample SμnS \sim \mu^n, and any exact ERM h^(S)\hat{h}(S):

μn ⁣{SR(h^(S))R(h)4B2dlog(en/d)n+2ε}2exp ⁣(ε2n8B2)\mu^n\!\left\{S \mid R(\hat{h}(S)) - R(h^*) \geq 4B\sqrt{\frac{2d\log(en/d)}{n}} + 2\varepsilon\right\} \leq 2\exp\!\left(-\frac{\varepsilon^2 n}{8B^2}\right)

Chains the deterministic ERM inequality (excess risk 2\leq 2 \cdot uniform deviation) with the VC-style two-sided uniform deviation bound. The leading term scales as O(dlog(n/d)/n)O(\sqrt{d \log(n/d) / n}).

Intuition

This is the standard VC sample-complexity theorem for ERM, expressed through the Rademacher route: Sauer-Shelah, Massart, Rademacher symmetrization, Azuma concentration, and ERM oracle inequality. The parameter dd plays the role of the VC dimension. For generic real-valued losses, the user supplies the effective-growth assumption. For binary classifiers with 0-1 loss, the binary-class VC bridge is now Lean-verified and closes that effective-loss-pattern step.

Failure Mode

The Azuma constant (8B28B^2) is a factor of 4 looser than the sharp McDiarmid constant. The theorem applies only to finite hypothesis classes with bounded loss. The broader project has separate Lean support for the one-Lipschitz contraction lemma, but this ERM tail still does not prove infinite-class ERM or the full PAC-learnability equivalence.

Proof Ideas and Templates Used

The proof of the finite-class ERM bound uses two standard tools:

  1. Hoeffding's inequality: to control the deviation of the empirical mean from the true mean for a single hypothesis
  2. Union bound: to extend the guarantee from a single hypothesis to all hypotheses in the class simultaneously

This "Hoeffding + union bound" pattern is the simplest proof template in learning theory. It breaks for infinite classes because the union bound over uncountably many hypotheses gives infinity.

Canonical Examples

Example

Finite binary classifiers

Suppose H\mathcal{H} contains 1000 binary classifiers and we use 0-1 loss. With n=10,000n = 10{,}000 training examples and δ=0.05\delta = 0.05:

suphHR(h)R^n(h)log(1000)+log(40)200000.023\sup_{h \in \mathcal{H}} |R(h) - \hat{R}_n(h)| \leq \sqrt{\frac{\log(1000) + \log(40)}{20000}} \approx 0.023

So the ERM hypothesis has population risk within about 2.3% of its training risk.

Example

Memorization is the prototypical ERM failure

Take X=[0,1]\mathcal{X} = [0, 1], binary labels y{0,1}y \in \{0, 1\}, and a noisy target: Pr[y=1x]=0.5\Pr[y = 1 \mid x] = 0.5 (pure noise, Bayes risk R=0.5R^* = 0.5). Let H\mathcal{H} be the class of all functions h:X{0,1}h: \mathcal{X} \to \{0, 1\}.

Draw nn training points (xi,yi)(x_i, y_i). Define the memorizing hypothesis:

hmem(x)={yiif x=xi for some i0otherwiseh_{\text{mem}}(x) = \begin{cases} y_i & \text{if } x = x_i \text{ for some } i \\ 0 & \text{otherwise} \end{cases}

Then R^n(hmem)=0\hat{R}_n(h_{\text{mem}}) = 0 exactly: every training point is classified correctly.

But under any continuous distribution on X\mathcal{X}, a freshly drawn xx hits a training xix_i with probability zero, so hmem(x)=0h_{\text{mem}}(x) = 0 almost surely. The population risk is:

R(hmem)=Pr[y=1]=0.5R(h_{\text{mem}}) = \Pr[y = 1] = 0.5

The empirical risk is zero; the population risk is the Bayes risk. The gap is 0.50.5, which is the worst possible value for binary 0-1 loss. ERM picks hmemh_{\text{mem}} because it achieves R^n=0\hat{R}_n = 0, but the choice is useless on new data.

The fix is not "pick a different ERM"; it is to constrain H\mathcal{H} so memorizing functions are excluded. The finite-class bound above shows that logH/n\sqrt{\log|\mathcal{H}|/n} is the price for this constraint when H\mathcal{H} is finite. The class of all functions has logH=\log|\mathcal{H}| = \infty on a continuous X\mathcal{X}, so the bound is vacuous and overfitting is unavoidable. This is exactly when VC dimension and Rademacher complexity take over.

Common Confusions

Watch Out

ERM is not the same as training loss minimization

ERM minimizes over a fixed hypothesis class H\mathcal{H}. In practice, when you train a neural network, the class is parameterized and the optimization landscape matters. The ERM framework abstracts away optimization difficulty and asks: if you could find the exact minimizer, would it generalize? Real training adds optimization error on top of the estimation error ERM theory controls.

Watch Out

Small training loss does not imply small test loss

A model can memorize the training set (achieving zero empirical risk) while having terrible population risk. This happens when H\mathcal{H} is too rich relative to nn. The generalization bound tells you when you can trust low training risk as a signal of low population risk.

Watch Out

Training loss and training error are not the same

In classification you typically optimize a convex surrogate loss (cross-entropy, hinge, logistic) while caring about the 0-1 classification error. "Minimizing training loss" (surrogate) and "minimizing training error" (0-1) are not equivalent. The connection is classification-calibration: a surrogate ϕ\phi is calibrated if and only if minimizing its population risk implies minimizing the 0-1 population risk (Bartlett, Jordan, McAuliffe 2006). Cross-entropy, hinge, and logistic are calibrated; squared loss on ±1\pm 1 targets is calibrated for binary classification; but not every convex surrogate is. Pages that use "training loss" and "training error" interchangeably are eliding this distinction.

Summary

  • Population risk R(h)R(h) is what you want to minimize; empirical risk R^n(h)\hat{R}_n(h) is what you can compute
  • The generalization gap R(h)R^n(h)R(h) - \hat{R}_n(h) must be controlled uniformly
  • For finite H\mathcal{H}: gap scales as logH/n\sqrt{\log|\mathcal{H}|/n}
  • Total excess risk = approximation error + estimation error
  • Richer H\mathcal{H} reduces approximation error but increases estimation error. This is the bias-variance tradeoff

Exercises

ExerciseCore

Problem

Suppose H\mathcal{H} has 100 hypotheses and the loss is bounded in [0,1][0,1]. How many samples nn do you need so that with probability at least 0.95, the uniform convergence gap is at most 0.1?

ExerciseAdvanced

Problem

Why can't you apply the finite-class ERM bound to the class of all linear classifiers in Rd\mathbb{R}^d? What goes wrong, and what tool do you need instead?

Related Comparisons

References

Canonical:

Current:

  • Mohri, Rostamizadeh, Talwalkar, Foundations of Machine Learning (2018), Chapter 2
  • Wainwright, High-Dimensional Statistics (2019), Chapters 4-6
  • Bartlett, Jordan, McAuliffe, "Convexity, Classification, and Risk Bounds," JASA (2006), arXiv:math/0508276 — classification-calibration of convex surrogate losses
  • Bartlett, Long, Lugosi, Tsigler, "Benign overfitting in linear regression," PNAS 117 (2020), arXiv:1906.11300 — interpolating ERM solutions can still generalize when the spectrum decays appropriately
  • Belkin, Hsu, Ma, Mandal, "Reconciling modern machine-learning practice and the classical bias-variance trade-off," PNAS 116 (2019), arXiv:1812.11118 — empirical "double descent" curves for ERM beyond the interpolation threshold

Next Topics

The natural next steps from ERM:

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Derived topics

13

+8 more on the derived-topics page.