Concentration Probability
Empirical Processes and Chaining
Bounding the supremum of empirical processes via covering numbers and chaining: Dudley's entropy integral and Talagrand's generic chaining, the sharpest tools in classical learning theory.
Prerequisites
Why This Matters
The central question of learning theory is: how fast does the empirical risk converge to the population risk, uniformly over a function class? This question reduces to bounding the supremum of an empirical process. Dudley's entropy integral and Talagrand's generic chaining are the two main tools for this task.
Dudley's bound converts covering number estimates (geometric information about the function class) into bounds on the expected supremum. Generic chaining improves on Dudley by a logarithmic factor in many cases. Together, they represent the culmination of the covering number approach to learning theory.
Formal Setup
Empirical Process
Given i.i.d. samples from distribution and a function class , the empirical process is:
where is the empirical measure.
Supremum of the Empirical Process
The quantity of interest for uniform convergence is:
Bounding controls how well empirical risk approximates population risk uniformly over .
Covering Number
The covering number is the minimum number of balls of radius (in metric ) needed to cover . The metric entropy is .
Packing Number
The packing number is the maximum number of points in that are pairwise at distance strictly greater than .
Covering and packing numbers are equivalent up to a factor of 2 in the radius:
This follows because a maximal -packing is automatically an -covering (no point can be at distance from all packing points without contradicting maximality). For chaining bounds, packing and covering numbers can be used interchangeably up to constants. See van Handel (2016), Lemma 4.1.
Metric Entropy: Catalog of Standard Classes
Before chaining can give a numerical bound, you need a covering number estimate for the function class. The following entries are the workhorses:
| Class | Metric | Source | |
|---|---|---|---|
| ball in , radius | Vershynin HDP, Cor. 4.2.13 | ||
| Linear class on | Wainwright HDS, Ex. 5.4 | ||
| VC class with , envelope 1 | van der Vaart-Wellner, Thm. 2.6.7 | ||
| -Lipschitz functions on , | sup-norm | Kolmogorov-Tikhomirov (1959) | |
| -smooth functions on , | sup-norm | Kolmogorov-Tikhomirov (1959) | |
| RKHS ball with kernel of decay rate | Steinwart-Christmann, Thm. 6.27 |
The qualitative regimes for chaining:
- Polynomial entropy : parametric rate, Dudley integral , generalization .
- Power entropy with : Dudley integral converges, nonparametric rate .
- Critical entropy : Dudley integral diverges; chaining alone is insufficient, you need localization.
- Heavy entropy : minimax-rate problem requires entirely different machinery (Le Cam's two-point method, Fano).
The Chaining Idea
The naive approach is to discretize at a single resolution and take a union bound over the -net. This gives:
The problem: choosing involves a tradeoff. Small controls approximation error but increases the covering number. Large is the reverse.
Chaining resolves this by discretizing at all scales simultaneously. Build a sequence of increasingly fine -nets with . For each , write where is the nearest point to in the -th net. Bound each increment separately.
Main Theorems
Dudley's Entropy Integral
Statement
Let be a function class with envelope (meaning for all ) and let . Then:
where is a universal constant. The integral converges whenever the covering numbers grow at most polynomially as .
Intuition
At each scale , the contribution to the supremum is controlled by (a union bound over the net at that scale). Chaining integrates these contributions across all scales. Coarse scales (large ) contribute little because the net is small. Fine scales (small ) contribute little because the increments are small. The integral balances these.
Proof Sketch
Fix a sequence of scales and corresponding -nets . For each , let be the closest point in . Write:
The first term is bounded by via a union bound. Each increment has sub-Gaussian parameter and there are choices, giving per scale. Summing over scales gives the integral.
Why It Matters
This is the standard tool for converting covering number estimates into generalization bounds. For a class with (like VC classes or linear predictors), the integral evaluates to , recovering the standard generalization bound without going through VC theory.
Failure Mode
Dudley's bound is not tight in general. It can be loose by a factor compared to the true expected supremum. For classes where the covering numbers have complex structure at different scales, generic chaining gives tighter results.
Talagrand's Generic Chaining (Majorizing Measures)
Statement
For a centered Gaussian process , define the functional:
where the infimum is over all sequences of sets with . Then:
for a universal constant . The functional characterizes the expected supremum up to universal constants.
Intuition
Generic chaining replaces the uniform covering at each scale (Dudley) with an optimized covering that can vary across the set . Some points in may need fine resolution while others do not. The functional captures this by allowing different "chains" for different points.
Proof Sketch
The upper bound follows the same chaining idea as Dudley, but with optimized nets instead of minimal coverings. The lower bound (which makes this a characterization, not just an upper bound) uses the Sudakov minoration argument and a sophisticated construction of the admissible sequence.
Why It Matters
This is the definitive result on suprema of Gaussian processes. Since empirical processes can be compared to Gaussian processes via symmetrization and comparison inequalities, this provides the tightest possible bounds on learning theory quantities. It resolves a long-standing question in probability theory about when chaining gives the right answer.
Failure Mode
The functional is hard to compute in general. For most applications, Dudley's entropy integral is sufficient and much more practical. Generic chaining is most useful for proving theoretical optimality results rather than computing explicit bounds.
Lower Bounds: Sudakov Minoration
Dudley's integral is an upper bound. Without a matching lower bound there is no way to know whether the integral overstates the true expected supremum. Sudakov's minoration provides a single-scale lower bound for Gaussian processes.
Sudakov's Minoration
Statement
Let be a centered Gaussian process with canonical metric . Then for every :
for a universal constant . Equivalently, since :
Intuition
A packing of size in produces Gaussians with pairwise standard deviation at least . The maximum of independent or weakly correlated standard Gaussians grows like , so taking the maximum over the packing gives a contribution of order . The minoration extracts this single-scale lower bound; supremum over is the sharpest version.
Proof Sketch
Fix an -packing where . Compare the Gaussian process to an independent collection of random variables using the Slepian inequality: since , the original process has larger increments and hence at least as large an expected supremum. The expected supremum of i.i.d. variables is for . See van Handel (2016), Theorem 5.30.
Why It Matters
Combined with Dudley's upper bound, Sudakov gives:
The two bounds match up to a factor in the worst case, and exactly when is regularly varying. This is why Talagrand's functional was needed: it closes the gap and gives a characterization up to universal constants.
Failure Mode
Sudakov is sharp only at the single optimizing scale and only for Gaussian processes (or sub-Gaussian processes via a comparison theorem). For sub-exponential or heavy-tailed processes, neither Sudakov nor Dudley directly applies; you need separate tools (Bernstein-type concentration, generic chaining with ).
Gaussian Comparison: Slepian-Sudakov-Fernique
Sudakov's proof uses a Gaussian comparison inequality. The same tool lets you compare the expected supremum of a complicated Gaussian process to a simpler reference process whose supremum is known.
Sudakov-Fernique Inequality
Statement
Let and be centered Gaussian processes on a finite index set . If
then
Intuition
Larger increments push points apart more, so the maximum spreads farther from zero in expectation. The comparison is on increments, not on marginal variances; only the geometry of the process matters.
Slepian's inequality is the stronger statement: under the same increment condition plus equal marginal variances (), every tail probability of exceeds the corresponding probability for . Sudakov-Fernique drops the equal-variance condition at the cost of giving only an expectation comparison.
Proof Sketch
Interpolate between the two processes via . Differentiate in using Gaussian integration by parts (Stein's lemma). The increment-domination condition makes the derivative non-positive, so the expected max decreases as moves from () to (). See Vershynin HDP, Theorem 7.2.11.
Why It Matters
This is the key tool to pass from a Gaussian process indexed by a function class to a canonical Gaussian process whose suprema are explicit. Examples:
- Random matrix operator norm: the operator norm of an Gaussian matrix can be sandwiched between two reference processes, yielding (Gordon's theorem).
- Comparison to white noise: bounds on suprema of empirical processes often pass through comparison with a Gaussian process built from Brownian motion.
- Chevet-type inequalities: bilinear bounds for products of two function classes use Sudakov-Fernique on the bilinear Gaussian process .
Failure Mode
Both processes must be Gaussian. The inequality fails for sub-Gaussian processes in general; the Talagrand-Borell-Tsirelson comparison theorems are the sub-Gaussian replacements but require additional structure.
Contraction (Ledoux-Talagrand)
In learning theory you often need to compare the Rademacher complexity of a loss class to the Rademacher complexity of the hypothesis class . The Ledoux-Talagrand contraction inequality makes this exchange explicit when the loss is Lipschitz.
Ledoux-Talagrand Contraction Inequality
Statement
Let be a class of real-valued functions on . Let be -Lipschitz with . Let be i.i.d. Rademacher signs and fixed. Then:
Dividing both sides by gives the Rademacher-complexity form:
The constant is the standard Ledoux-Talagrand constant; under the additional assumption that is symmetric (), the constant tightens to (Boucheron-Lugosi-Massart, Theorem 11.6).
Intuition
Pointwise Lipschitz application cannot stretch distances, only contract them. The Rademacher complexity measures how much the function class can correlate with random signs on the data. Contraction passes through this correlation up to a factor of the Lipschitz constant. The anchoring removes a deterministic shift that would otherwise add a constant to the supremum without increasing the complexity.
Proof Sketch
Peel off one coordinate at a time. Conditional on , the remaining randomness is in . Show that for any function of the first coordinates and any pair :
This is a direct calculation using and the symmetry of . Iterate over coordinates. The factor of enters when reducing to a symmetric class via . See Ledoux-Talagrand (1991), Theorem 4.12; cleaner exposition in Mohri- Rostamizadeh-Talwalkar (2018), Lemma 5.7.
Why It Matters
This is the bridge from (the quantity you can compute or bound via covering numbers) to (the quantity that appears in the generalization bound).
- Margin loss: is -Lipschitz, so the margin-loss complexity is at most .
- Logistic loss: is -Lipschitz, same bound.
- Hinge loss: is -Lipschitz with ; use a centering trick or absorb the constant into the bias.
For neural networks where is bounded in but the post-activation is squashed by a Lipschitz nonlinearity, this gives layer-by- layer Rademacher bounds.
For full statement and proof discussion see contraction inequality.
Failure Mode
Two failure modes appear in practice:
- The anchoring condition matters. Without it, the inequality picks up a term that does not vanish even when the function class is centered. Most ML losses can be shifted to satisfy this without changing the learned predictor.
- Vector contraction is more subtle. For a class of vector-valued functions with a Lipschitz scalarization, the Maurer (2016) vector contraction inequality gives a bound of order per coordinate but requires -dependence in the worst case. This matters for multi-class classification where the effective is the number of classes.
Connection to Learning Theory
The Rademacher complexity of a function class is:
By symmetrization, . Dudley's bound then gives:
This is how covering numbers and chaining enter generalization bounds.
Canonical Examples
Linear classifiers in R^d
For , the covering number satisfies . Dudley's integral gives , yielding via the chaining route.
This bound is loose for the -bounded class. A direct Cauchy-Schwarz argument gives a dimension-free bound: , independent of . Chaining recovers the rate appropriate for - or -bounded weight classes; for the ball, the direct bound is tighter and is the one used in standard analyses of margin-based classifiers and kernel methods.
Lipschitz functions on the unit cube
Let . Kolmogorov-Tikhomirov (1959) gives:
Substituting into Dudley's integral:
The integral converges only when . For , it gives and the Rademacher complexity is , the parametric rate. For , Dudley's integral diverges and you need localization: only small balls around the empirical risk minimizer matter, and the entropy of those small balls grows slower with . Localized Rademacher complexity (Bartlett-Bousquet-Mendelson, 2005) gives the correct rate for .
This is why high-dimensional nonparametric regression suffers the curse of dimensionality even with Lipschitz smoothness: the metric entropy is just too large for naive chaining.
Function class with VC dimension d
For a binary VC class with , Haussler's theorem gives:
Dudley's integral over evaluates to:
(The substitution gives .)
This recovers the classical generalization bound through chaining, without any combinatorial Sauer- Shelah argument. The chaining route generalizes to non-binary classes; the Sauer-Shelah route does not.
Common Confusions
Chaining is not a single-scale argument
The power of chaining comes from simultaneously using approximations at all scales. A single-scale argument (discretize at one ) always involves a tradeoff between approximation error and the size of the net. Chaining eliminates this tradeoff by summing contributions across scales.
Dudley vs Talagrand: when does the difference matter?
For most function classes encountered in ML (VC classes, kernel classes, neural network classes with bounded norms), Dudley's integral gives bounds that are tight up to logarithmic factors. Generic chaining matters when the geometry of the function class has different complexity at different scales, which is rare in standard applications but important in probability theory.
The Lipschitz constant in contraction is not always 1
Many ML practitioners assume the contraction inequality is "free" because common losses are 1-Lipschitz. But the relevant Lipschitz constant is the one for as a function of the hypothesis output, not as a function of the loss inputs. For squared loss on bounded predictions and labels , the Lipschitz constant in is , not 1. Forgetting the factor is a common source of vacuous bounds.
Sudakov is for Gaussian, Dudley is for sub-Gaussian
Dudley's entropy integral applies to any process with sub-Gaussian increments (i.e., ). Sudakov's minoration applies only to Gaussian processes; the lower bound can fail for sub-Gaussian processes. For empirical processes, you typically use Dudley directly (since empirical processes are sub-Gaussian conditional on the data) and prove lower bounds via Le Cam's method or Fano's inequality rather than Sudakov.
Empirical process supremum vs uniform deviation
The supremum of the empirical process is the natural quantity for two-sided bounds. Some papers work with one-sided , which is half as large in expectation but admits slightly tighter constants. Both lead to the same generalization rate; only constants differ. Pay attention to which version a paper uses before plugging numbers in.
Summary
- Empirical process theory reduces generalization to bounding
- Dudley's entropy integral:
- Chaining works by discretizing at all scales simultaneously
- Generic chaining () is tight for Gaussian processes but hard to compute
- Sudakov-Fernique compares Gaussian processes via increment domination
- Ledoux-Talagrand contraction passes Rademacher bounds through Lipschitz losses with constant
- For practical ML bounds, Dudley's integral with covering number estimates suffices
Exercises
Problem
For a function class with , evaluate Dudley's entropy integral up to constants and state the resulting generalization bound.
Problem
Explain why a single-scale -net argument gives a bound of and why optimizing over gives , which is worse than Dudley's by a factor.
Problem
A neural network with one hidden layer of width , sigmoid activation, and weights bounded by has covering number satisfying where is the input dimension. Apply Dudley's integral plus the Ledoux-Talagrand contraction inequality to bound the Rademacher complexity of the loss class for the margin-loss surrogate with margin .
Problem
State Sudakov's minoration: a lower bound on the expected supremum of a Gaussian process in terms of a single-scale packing number. Explain why this shows Dudley's integral cannot be improved by more than a logarithmic factor.
References
Canonical:
- van der Vaart & Wellner (1996), Weak Convergence and Empirical Processes, Chapters 2.5-2.6 (Dudley) and 2.7 (chaining bracketing)
- Vershynin (2018), High-Dimensional Probability, Chapter 8 (chaining), Theorem 7.2.11 (Sudakov-Fernique)
- van Handel (2016), Probability in High Dimension, Chapters 4-5 (Dudley, Sudakov, generic chaining)
- Ledoux & Talagrand (1991), Probability in Banach Spaces, Theorem 4.12 (contraction inequality)
Current:
- Talagrand (2014), Upper and Lower Bounds for Stochastic Processes (2nd ed.) — definitive treatment of generic chaining and
- Boucheron, Lugosi, Massart (2013), Concentration Inequalities, Chapters 11-13 (Rademacher contraction, chaining)
- Wainwright (2019), High-Dimensional Statistics, Chapter 5 (uniform laws via chaining)
- Mohri, Rostamizadeh, Talwalkar (2018), Foundations of Machine Learning (2nd ed.), Lemma 5.7 (contraction with constant )
- Maurer (2016), "A vector-contraction inequality for Rademacher complexities," ALT
Frontier:
- Bartlett, Bousquet, Mendelson (2005), "Local Rademacher complexities," Ann. Stat. 33(4), 1497-1537 — localization for fast rates beyond Dudley
- Bartlett, Foster, Telgarsky (2017), "Spectrally-normalized margin bounds for neural networks," NeurIPS — chaining-based norm capacity bounds for deep nets
- Koltchinskii (2011), Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems, Saint-Flour XXXVIII
Next Topics
- Matrix concentration: chaining with operator-norm increments
- Minimax lower bounds: using empirical process theory to prove optimality
Last reviewed: May 5, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
6- Asymptotic Statistics: M-Estimators, Delta Method, LANlayer 0B · tier 1
- Epsilon-Nets and Covering Numberslayer 3 · tier 1
- Measure Concentration and Geometric Functional Analysislayer 3 · tier 1
- Rademacher Complexitylayer 3 · tier 1
- Glivenko-Cantelli Theoremlayer 2 · tier 2
Derived topics
2- Matrix Concentrationlayer 3 · tier 1
- Minimax Lower Bounds: Le Cam, Fano, Assouad, and the Reduction to Testinglayer 3 · tier 1
Graph-backed continuations