Empirical Risk Minimization

Sneiderman, Robby

Learning Theory Core

Empirical Risk Minimization

The foundational principle of statistical learning: minimize average loss on training data as a proxy for minimizing true population risk.

CoreTier 1StableCore spine~60 min

Prerequisites

Concentration Inequalities Common Probability Distributions Common Inequalities High Dimensional Probability Book

Quiz (44)Pulse Check Prereq Map

Why This Matters

Many core supervised learning methods are empirical risk minimizers, or regularized variants thereof. When you train a neural network by minimizing cross-entropy loss on a training set, you are doing ERM. When you fit a linear regression by minimizing squared error, you are doing ERM. Other supervised methods sit outside the ERM frame: $k$ -nearest neighbors and Nadaraya-Watson kernel regression are local averaging rules with no global empirical-risk objective; full Bayesian posterior predictives average over a posterior rather than minimize a fixed loss; conformal prediction wraps a base learner and is calibrated on hold-out data instead of fitted by minimization. ERM is the dominant frame, not a universal one.

Five-panel infographic: ERM as picking the lowest-empirical-risk hypothesis as a proxy for true risk, the population vs empirical risk distinction, the ERM rule (argmin over H), generalization route via uniform convergence, and worked examples (squared loss, logistic, NLL, SGD). — ERM turns 'low training loss' into a bet about future data. Generalization theory studies when that bet is justified.

Understanding ERM is the first step toward understanding why learning from data works at all, and where it fails.

theorem visual

ERM turns training loss into a bet about future data

$ERM chooses the hypothesis with the lowest empirical risk. Generalization theory asks when that training-set winner is also close to the population winner.$

population target

$R (h) = E [ℓ (h (X), Y)]$

$The risk we care about, but cannot compute directly because the data distribution is unknown.$

training proxy

$R_{n} (h) = \frac{1}{n} \sum_{i} ℓ (h (x_{i}), y_{i})$

$The computable average loss on the sample.$

ERM guarantee

$sup_{h} ∣ R (h) - R_{n} (h) ∣$

$If this is small uniformly over the class, minimizing training loss transfers to future data.$

Mental Model

Imagine you want to find a function $h$ that predicts well on future, unseen data. You cannot directly optimize over data you have not seen. So instead, you optimize over the data you have: your training set. And hope that good performance on training data translates to good performance on new data.

ERM formalizes this hope. The central question of learning theory is: when and why does this hope actually work?

Formal Setup and Notation

Let $\mathcal{X}$ be an input space and $\mathcal{Y}$ an output space. Let $\mathcal{D}$ be an unknown probability distribution over $\mathcal{X} \times \mathcal{Y}$ .

We have a loss function $\ell: \mathcal{Y} \times \mathcal{Y} \to [0, \infty)$ measuring prediction quality.

Notation reconciliation. TheoremPath uses $R(h)$ for population risk and $\hat{R}_n(h)$ for empirical risk. Shalev-Shwartz and Ben-David use $L_{(\mathcal{D}, f)}(h)$ and $L_S(h)$ respectively, with the second subscript $f$ encoding the realizable target. The statements coincide when the loss is 0-1 and $f$ is the labeling function: $L_{(\mathcal{D}, f)}(h) = \mathbb{P}_{x \sim \mathcal{D}}[h(x) \ne f(x)]$ matches $R(h)$ , and $L_S(h) = |\{i : h(x_i) \ne y_i\}| / m$ matches $\hat{R}_n(h)$ on training data labelled by $f$ . We use the SSBD notation when the realizability assumption is the analytical anchor; otherwise the $R / \hat{R}_n$ form. The proof structure is the same; only the symbols change.

The phrase inductive bias from SSBD refers to the prior choice of hypothesis class $\mathcal{H}$ that constrains the search space before any data is seen. Restricting $\mathcal{H}$ is what protects ERM from overfitting on finite samples; the realizability assumption and the choice of class together determine when the protection is enough.

Definition

Population Risk $R (h)$

The population risk (or true risk) of a hypothesis $h$ is:

$R(h) = \mathbb{E}_{(x,y) \sim \mathcal{D}}[\ell(h(x), y)]$

This is the quantity we actually want to minimize, but we cannot compute it directly because $\mathcal{D}$ is unknown.

Definition

Empirical Risk $\hat{R}_{n} (h)$

Given a training sample $S = \{(x_1, y_1), \ldots, (x_n, y_n)\}$ drawn i.i.d. from $\mathcal{D}$ , the empirical risk is:

$\hat{R}_n(h) = \frac{1}{n} \sum_{i=1}^{n} \ell(h(x_i), y_i)$

This is computable. It is the average loss on training data.

Definition

Empirical Risk Minimization

Given a hypothesis class $\mathcal{H}$ , the ERM principle selects:

$h_{\text{ERM}} = \arg\min_{h \in \mathcal{H}} \hat{R}_n(h)$

The key question: when is $R(h_{\text{ERM}})$ close to $\min_{h \in \mathcal{H}} R(h)$ ?

Core Definitions

The generalization gap is the difference between population risk and empirical risk:

$R(h) - \hat{R}_n(h)$

If this gap is small uniformly over all $h \in \mathcal{H}$ , then minimizing empirical risk also approximately minimizes population risk.

The approximation error is:

$\min_{h \in \mathcal{H}} R(h) - R^*$

where $R^* = \inf_h R(h)$ is the Bayes risk. This measures how well the best hypothesis in $\mathcal{H}$ can do, regardless of sample size.

The estimation error is:

$R(h_{\text{ERM}}) - \min_{h \in \mathcal{H}} R(h)$

This is what ERM theory actually controls.

Main Theorems

Theorem

ERM Generalization via Uniform Convergence

Statement

If $|\mathcal{H}|$ is finite and the loss is bounded in $[0,1]$ , then for any $\delta > 0$ , with probability at least $1 - \delta$ over the draw of $S$ :

$\sup_{h \in \mathcal{H}} |R(h) - \hat{R}_n(h)| \leq \sqrt{\frac{\log|\mathcal{H}| + \log(2/\delta)}{2n}}$

Intuition

The bound says: if the hypothesis class is not too large (measured by $\log|\mathcal{H}|$ ) and the sample is large enough, then empirical risk uniformly approximates population risk. The ERM hypothesis therefore has population risk close to the best in class.

Proof Sketch

Apply Hoeffding's inequality to each $h \in \mathcal{H}$ to get $|R(h) - \hat{R}_n(h)| \leq \epsilon$ with probability $\geq 1 - 2e^{-2n\epsilon^2}$ . Then take a union bound over all $|\mathcal{H}|$ hypotheses. Solve for $\epsilon$ .

Why It Matters

This is the simplest generalization bound in learning theory. It shows why ERM works for finite hypothesis classes. The dependence on $\log|\mathcal{H}|$ rather than $|\mathcal{H}|$ is crucial. It means the sample complexity grows only logarithmically in the number of hypotheses.

Failure Mode

This bound is vacuous for infinite hypothesis classes (e.g., all linear classifiers). You need more sophisticated complexity measures: VC dimension, Rademacher complexity, or covering numbers.

report a correction →

Lean-verified finite-class ERM tail

The Lean proof covers the finite-class Hoeffding chain: pointwise tail, finite-class union bound, closed-form rearrangement, sample-complexity bound, and ERM oracle inequality composed into a single excess-risk tail statement. Broader ERM claims for infinite or algorithm-specific hypothesis families stay source-checked until matching Lean statements are present.

Theorem

Finite-class Hoeffding ERM excess-risk tail (Lean-verified)

Statement

For a finite nonempty hypothesis class $\mathcal{H}$ with iid loss bounded in $[a, b]$ , an exact ERM hypothesis $\hat{h}$ chosen on the empirical risk, and a best-in-class hypothesis $h^* \in \arg\min_{h \in \mathcal{H}} R(h)$ , if

$n \;\geq\; \frac{(b-a)^2 \,\log(2|\mathcal{H}|/\delta)}{2\,\varepsilon^2}$

then

$\mu_\Omega\bigl(\bigl\{\omega : 2\varepsilon \;\leq\; R(\hat{h}(\omega)) - R(h^*)\bigr\}\bigr) \;\leq\; \delta.$

Proof Sketch

Apply the closed-form Hoeffding tail to the uniform deviation, then the ERM oracle inequality $R(\hat{h}) \leq R(h^*) + 2\,\mathrm{uniformDev}$ , then algebra on the bad event. The Lean proof is in TheoremPath.LearningTheory.ERMGeneralization.finiteClass_hoeffding_exactERM_excessRisk_tail (see the Lean manifest entry claim:empirical-risk-minimization::finite-class-hoeffding-erm-excess-risk-tail). The chain depends only on propext, Classical.choice, and Quot.sound.

Failure Mode

Lean-verified finite-class Hoeffding ERM excess-risk tail. This does not verify VC bounds, Rademacher bounds, infinite hypothesis classes, or generic PAC learnability. The page-level claim that "ERM works" still requires source review for the standard textbook treatment; the bound above is one specific tail statement on one specific hypothesis class shape.

report a correction →

Theorem

Finite-class Hoeffding approximate-ERM excess-risk tail (Lean-verified)

Statement

Same setup as above but with $\hat{h}$ only an approximate ERM with optimization slack $\eta \geq 0$ (i.e., $\hat{R}_n(\hat{h}) \leq \hat{R}_n(h) + \eta$ for all $h \in \mathcal{H}$ ). Then

$\mu_\Omega\bigl(\bigl\{\omega : 2\varepsilon + \eta \;\leq\; R(\hat{h}(\omega)) - R(h^*)\bigr\}\bigr) \;\leq\; \delta.$

Proof Sketch

The optimization slack $\eta$ enters additively from the approximate-ERM oracle inequality $R(\hat{h}) \leq R(h^*) + 2\,\mathrm{uniformDev} + \eta$ . Same Hoeffding tail as the exact case. Lean proof: TheoremPath.LearningTheory.ERMGeneralization.finiteClass_hoeffding_approxERM_excessRisk_tail (manifest entry claim:empirical-risk-minimization::finite-class-hoeffding-approx-erm-excess-risk-tail).

Failure Mode

Same scope limits as the exact-ERM tail. Tying $\eta$ to a concrete optimization algorithm (e.g., SGD convergence rate) is a separate verification step that has not been done.

report a correction →

Theorem

Effective-Growth VC-Style ERM Excess-Risk Tail

Statement

For a hypothesis class with effective-class cardinality bounded by the Sauer-Shelah growth function $\sum_{k \leq d} \binom{n}{k}$ (for both $\ell$ and $-\ell$ ), $B$ -bounded loss, iid sample $S \sim \mu^n$ , and any exact ERM $\hat{h}(S)$ :

$\mu^n\!\left\{S \mid R(\hat{h}(S)) - R(h^*) \geq 4B\sqrt{\frac{2d\log(en/d)}{n}} + 2\varepsilon\right\} \leq 2\exp\!\left(-\frac{\varepsilon^2 n}{8B^2}\right)$

Chains the deterministic ERM inequality (excess risk $\leq 2 \cdot$ uniform deviation) with the VC-style two-sided uniform deviation bound. The leading term scales as $O(\sqrt{d \log(n/d) / n})$ .

Intuition

This is the standard VC sample-complexity theorem for ERM, expressed through the Rademacher route: Sauer-Shelah, Massart, Rademacher symmetrization, Azuma concentration, and ERM oracle inequality. The parameter $d$ plays the role of the VC dimension. For generic real-valued losses, the user supplies the effective-growth assumption. For binary classifiers with 0-1 loss, the binary-class VC bridge is now Lean-verified and closes that effective-loss-pattern step.

Failure Mode

The Azuma constant ( $8B^2$ ) is a factor of 4 looser than the sharp McDiarmid constant. The theorem applies only to finite hypothesis classes with bounded loss. The broader project has separate Lean support for the one-Lipschitz contraction lemma, but this ERM tail still does not prove infinite-class ERM or the full PAC-learnability equivalence.

report a correction →

Proof Ideas and Templates Used

The proof of the finite-class ERM bound uses two standard tools:

Hoeffding's inequality: to control the deviation of the empirical mean from the true mean for a single hypothesis
Union bound: to extend the guarantee from a single hypothesis to all hypotheses in the class simultaneously

This "Hoeffding + union bound" pattern is the simplest proof template in learning theory. It breaks for infinite classes because the union bound over uncountably many hypotheses gives infinity.

Canonical Examples

Example

Finite binary classifiers

Suppose $\mathcal{H}$ contains 1000 binary classifiers and we use 0-1 loss. With $n = 10{,}000$ training examples and $\delta = 0.05$ :

$\sup_{h \in \mathcal{H}} |R(h) - \hat{R}_n(h)| \leq \sqrt{\frac{\log(1000) + \log(40)}{20000}} \approx 0.023$

So the ERM hypothesis has population risk within about 2.3% of its training risk.

Example

Memorization is the prototypical ERM failure

Take $\mathcal{X} = [0, 1]$ , binary labels $y \in \{0, 1\}$ , and a noisy target: $\Pr[y = 1 \mid x] = 0.5$ (pure noise, Bayes risk $R^* = 0.5$ ). Let $\mathcal{H}$ be the class of all functions $h: \mathcal{X} \to \{0, 1\}$ .

Draw $n$ training points $(x_i, y_i)$ . Define the memorizing hypothesis:

$h_{\text{mem}}(x) = \begin{cases} y_i & \text{if } x = x_i \text{ for some } i \\ 0 & \text{otherwise} \end{cases}$

Then $\hat{R}_n(h_{\text{mem}}) = 0$ exactly: every training point is classified correctly.

But under any continuous distribution on $\mathcal{X}$ , a freshly drawn $x$ hits a training $x_i$ with probability zero, so $h_{\text{mem}}(x) = 0$ almost surely. The population risk is:

$R(h_{\text{mem}}) = \Pr[y = 1] = 0.5$

The empirical risk is zero; the population risk is the Bayes risk. The gap is $0.5$ , which is the worst possible value for binary 0-1 loss. ERM picks $h_{\text{mem}}$ because it achieves $\hat{R}_n = 0$ , but the choice is useless on new data.

The fix is not "pick a different ERM"; it is to constrain $\mathcal{H}$ so memorizing functions are excluded. The finite-class bound above shows that $\sqrt{\log|\mathcal{H}|/n}$ is the price for this constraint when $\mathcal{H}$ is finite. The class of all functions has $\log|\mathcal{H}| = \infty$ on a continuous $\mathcal{X}$ , so the bound is vacuous and overfitting is unavoidable. This is exactly when VC dimension and Rademacher complexity take over.

Common Confusions

Watch Out

ERM is not the same as training loss minimization

ERM minimizes over a fixed hypothesis class $\mathcal{H}$ . In practice, when you train a neural network, the class is parameterized and the optimization landscape matters. The ERM framework abstracts away optimization difficulty and asks: if you could find the exact minimizer, would it generalize? Real training adds optimization error on top of the estimation error ERM theory controls.

Watch Out

Small training loss does not imply small test loss

A model can memorize the training set (achieving zero empirical risk) while having terrible population risk. This happens when $\mathcal{H}$ is too rich relative to $n$ . The generalization bound tells you when you can trust low training risk as a signal of low population risk.

Watch Out

Training loss and training error are not the same

In classification you typically optimize a convex surrogate loss (cross-entropy, hinge, logistic) while caring about the 0-1 classification error. "Minimizing training loss" (surrogate) and "minimizing training error" (0-1) are not equivalent. The connection is classification-calibration: a surrogate $\phi$ is calibrated if and only if minimizing its population risk implies minimizing the 0-1 population risk (Bartlett, Jordan, McAuliffe 2006). Cross-entropy, hinge, and logistic are calibrated; squared loss on $\pm 1$ targets is calibrated for binary classification; but not every convex surrogate is. Pages that use "training loss" and "training error" interchangeably are eliding this distinction.

Summary

Population risk $R(h)$ is what you want to minimize; empirical risk $\hat{R}_n(h)$ is what you can compute
The generalization gap $R(h) - \hat{R}_n(h)$ must be controlled uniformly
For finite $\mathcal{H}$ : gap scales as $\sqrt{\log|\mathcal{H}|/n}$
Total excess risk = approximation error + estimation error
Richer $\mathcal{H}$ reduces approximation error but increases estimation error. This is the bias-variance tradeoff

Exercises

ExerciseCore

Problem

Suppose $\mathcal{H}$ has 100 hypotheses and the loss is bounded in $[0,1]$ . How many samples $n$ do you need so that with probability at least 0.95, the uniform convergence gap is at most 0.1?

ExerciseAdvanced

Problem

Why can't you apply the finite-class ERM bound to the class of all linear classifiers in $\mathbb{R}^d$ ? What goes wrong, and what tool do you need instead?

Related Comparisons

PAC Learning vs. Agnostic PAC Learning

References

Canonical:

Shalev-Shwartz & Ben-David, Understanding Machine Learning, Chapter 2-4
Vapnik, Statistical Learning Theory (1998), Chapters 1-4
Boucheron, Lugosi, Massart, Concentration Inequalities (2013), Chapter 4

Current:

Mohri, Rostamizadeh, Talwalkar, Foundations of Machine Learning (2018), Chapter 2
Wainwright, High-Dimensional Statistics (2019), Chapters 4-6
Bartlett, Jordan, McAuliffe, "Convexity, Classification, and Risk Bounds," JASA (2006), arXiv:math/0508276 — classification-calibration of convex surrogate losses
Bartlett, Long, Lugosi, Tsigler, "Benign overfitting in linear regression," PNAS 117 (2020), arXiv:1906.11300 — interpolating ERM solutions can still generalize when the spectrum decays appropriately
Belkin, Hsu, Ma, Mandal, "Reconciling modern machine-learning practice and the classical bias-variance trade-off," PNAS 116 (2019), arXiv:1812.11118 — empirical "double descent" curves for ERM beyond the interpolation threshold

Next Topics

The natural next steps from ERM:

Uniform convergence: the general framework for proving ERM works
VC dimension: controlling complexity of infinite hypothesis classes
Rademacher complexity: data-dependent complexity measurement

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

10

Common Inequalitieslayer 0A · tier 1
Common Probability Distributionslayer 0A · tier 1
Law of Large Numberslayer 0B · tier 1
Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiencylayer 0B · tier 1
Concentration Inequalitieslayer 1 · tier 1

Derived topics

13

Hypothesis Classes and Function Spaceslayer 2 · tier 1
Overfitting and Underfittinglayer 2 · tier 1
Realizability Assumptionlayer 2 · tier 1
Uniform Convergencelayer 2 · tier 1
VC Dimensionlayer 2 · tier 1

+8 more on the derived-topics page.

Graph-backed continuations

Uniform Convergence VC Dimension Rademacher Complexity Algorithmic Stability Bias-Variance Tradeoff Cross-Validation Theory Decision Trees and Ensembles Hallucination Theory Hypothesis Classes and Function Spaces Overfitting and Underfitting Realizability Assumption Bias-Complexity Tradeoff No-Free-Lunch Theorem