Skip to main content

Concentration Probability

Concentration Inequalities

Bounds on how far random variables deviate from their expectations: Markov, Chebyshev, Hoeffding, and Bernstein. Used throughout generalization theory, bandits, and sample complexity.

CoreTier 1StableCore spine~70 min

Why This Matters

Concentration inequalities are the backbone of classical uniform-convergence statistical learning theory. Most PAC-style sample complexity results and the core PAC-learning theorems ultimately rest on a statement of the form: "the empirical average is close to the true expectation with high probability." Modern generalization bounds (PAC-Bayes, algorithmic stability, information-theoretic bounds, and algorithm-dependent analyses) sometimes route through concentration only indirectly or replace boundedness assumptions with weaker moment / sub-Gaussian conditions, so the dependence is universal in spirit but not literal in form.

Hide overviewShow overview
Five-panel infographic on concentration inequalities: the goal (tail bounds for random quantities staying close to their typical value), Markov's inequality as the loosest baseline using only the mean, Chebyshev using the variance, Hoeffding for sums of bounded independent variables (with an exponential-decay tail probability plot), and a how-to-choose-a-bound table comparing Markov / Chebyshev / Hoeffding / Bernstein with practical applications.
Concentration inequalities convert assumptions like nonnegativity, bounded variance, or boundedness into explicit tail bounds. The right one depends on what you can assume about the random variable.

When you see a bound like R^n(h)R(h)\hat{R}_n(h) \approx R(h) in ERM theory, the proof always invokes a concentration inequality. Without these tools, you cannot prove anything about learning from finite data.

These are the four inequalities you will use most often, in order of increasing power and increasing assumptions:

  1. Markov: requires only non-negativity
  2. Chebyshev: requires finite variance
  3. Hoeffding: requires bounded random variables
  4. Bernstein: requires bounded range and uses variance information

Mental Model

Think of concentration inequalities as answering one question: how likely is it that the sample average Xˉn=1ni=1nXi\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i deviates from E[X]\mathbb{E}[X] by more than tt?

Each inequality trades assumptions for power:

  • Markov/Chebyshev give polynomial tails (1/t21/t^2): slow decay.
  • Hoeffding/Bernstein give exponential tails (ecnt2e^{-cnt^2}): fast decay.

The exponential tail bounds are what make learning theory work. With polynomial tails, you need n=O(1/ϵ2δ)n = O(1/\epsilon^2 \delta) samples. With exponential tails, you need n=O(log(1/δ)/ϵ2)n = O(\log(1/\delta)/\epsilon^2): the dependence on the confidence δ\delta drops inside a logarithm.

Which Bound Should You Reach For?

This is the practical decision rule:

What you knowDefault boundTail shapeWhy you reach for it
Only non-negativityMarkovpolynomial, 1/t1/tweakest possible structural assumption; useful as a first sanity bound
Finite variance but no bounded rangeChebyshevpolynomial, 1/t21/t^2works when variance is known but boundedness or MGF control is unavailable
Independent bounded summandsHoeffdingexponential, roughly ecnt2e^{-cnt^2}clean default for finite-sample learning bounds and bounded losses
Independent bounded summands with genuinely small varianceBernsteinGaussian near the origin, then sub-exponentialvariance-adaptive; much tighter when the loss is sparse or low-noise

Stronger assumptions do not mean the resulting numeric bound is always smaller at every (n,t)(n,t). The asymptotically stronger inequality can still be looser at modest sample sizes, which is why the worked example below compares Hoeffding and Chebyshev directly instead of repeating the slogan that "exponential beats polynomial."

Interactive: when Bernstein beats Hoeffding

Chebyshev, Hoeffding, and Bernstein all upper-bound the same tail probability. They do not agree on how much slack to leave. Plotted against the true binomial tail for Sn=1nXiS_n = \frac{1}{n}\sum X_i with XiBernoulli(p)X_i \sim \mathrm{Bernoulli}(p), the three bounds arrange themselves by the amount of variance information they use. Push pp toward 00 and Bernstein splits away from Hoeffding because σ2=p(1p)\sigma^2 = p(1-p) shrinks. Push pp to 0.50.5 and the two exponential bounds nearly coincide.

Formal Setup and Notation

Let X1,X2,,XnX_1, X_2, \ldots, X_n be independent random variables. We write Xˉn=1ni=1nXi\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i for the sample mean and μ=E[Xˉn]\mu = \mathbb{E}[\bar{X}_n] for its expectation (which equals E[X1]\mathbb{E}[X_1] when the XiX_i are identically distributed).

We seek upper bounds on the tail probability:

Pr[Xˉnμt]\Pr[|\bar{X}_n - \mu| \geq t]

or equivalently the one-sided tail Pr[Xˉnμt]\Pr[\bar{X}_n - \mu \geq t].

Core Definitions

Definition

Tail Probability

The tail probability of a random variable XX at level tt is Pr[Xt]\Pr[X \geq t]. Concentration inequalities provide upper bounds on tail probabilities, showing that XX is unlikely to deviate far from its mean. A bound is useful when it decays rapidly in tt.

Definition

Sub-Gaussian Behavior

A random variable exhibits sub-Gaussian behavior if and only if its tail probability decays like ect2e^{-ct^2} for some constant c>0c > 0. Bounded random variables are sub-Gaussian (this is Hoeffding). The Gaussian decay rate is the "gold standard" for tail bounds. It means large deviations are exponentially unlikely.

Definition

Moment Generating Function

The moment generating function (MGF) of XX is MX(λ)=E[eλX]M_X(\lambda) = \mathbb{E}[e^{\lambda X}]. The MGF method — bounding the MGF and then optimizing over λ\lambda — is the standard technique for deriving exponential concentration inequalities. It is also called the Chernoff bounding method (Chernoff 1952); Wainwright (2019) uses this terminology. Cramér's theorem is a distinct, asymptotic result about the large-deviations rate function.

Main Theorems

Theorem

Markov's Inequality

Statement

If X0X \geq 0, then for any t>0t > 0:

Pr[Xt]E[X]t\Pr[X \geq t] \leq \frac{\mathbb{E}[X]}{t}

Exact statement

LaTeX source for copy/export

\Pr[X \geq t] \leq \frac{\mathbb{E}[X]}{t}

Intuition

If the average value of XX is small, then XX cannot be large too often. If E[X]=1\mathbb{E}[X] = 1, then XX can be 100\geq 100 at most 1% of the time; otherwise the average would be too high.

Proof Sketch

Indicator form.

E[X]=E[X1[Xt]]+E[X1[X<t]]E[X1[Xt]]tPr[Xt].\mathbb{E}[X] = \mathbb{E}[X \cdot \mathbf{1}[X \geq t]] + \mathbb{E}[X \cdot \mathbf{1}[X < t]] \geq \mathbb{E}[X \cdot \mathbf{1}[X \geq t]] \geq t \cdot \Pr[X \geq t].

Divide both sides by tt.

Integral (layer-cake) form. For X0X \geq 0, the tail formula E[X]=0Pr[Xx]dx\mathbb{E}[X] = \int_0^\infty \Pr[X \geq x]\,dx gives

E[X]0aPr[Xx]dx0aPr[Xa]dx=aPr[Xa],\mathbb{E}[X] \geq \int_0^a \Pr[X \geq x]\,dx \geq \int_0^a \Pr[X \geq a]\,dx = a \cdot \Pr[X \geq a],

where the second inequality uses Pr[Xx]Pr[Xa]\Pr[X \geq x] \geq \Pr[X \geq a] for every x[0,a]x \in [0, a] (the tail is nonincreasing). Divide by aa. This is the form used in Shalev-Shwartz and Ben-David, Understanding Machine Learning, Appendix B.

Why It Matters

Markov's inequality is the foundation that all other concentration inequalities build upon. Chebyshev applies Markov to (Xμ)2(X - \mu)^2. Hoeffding and Bernstein apply Markov to eλ(Xμ)e^{\lambda(X - \mu)}. It is the weakest bound but requires the fewest assumptions.

Failure Mode

The bound Pr[Xt]μ/t\Pr[X \geq t] \leq \mu/t decays only as 1/t1/t, polynomially. This is far too slow for most applications. If E[X]=1\mathbb{E}[X] = 1, Markov gives Pr[X10]1/10\Pr[X \geq 10] \leq 1/10, independent of any variance or boundedness structure. For well-behaved distributions the true probability is astronomically smaller. Markov uses only E[X]\mathbb{E}[X]; it cannot see σ\sigma. Chebyshev is the first inequality that brings variance information into the picture. You almost always want Hoeffding or Bernstein when those assumptions hold.

Lemma

Lemma B.1: Lower-tail bound for [0,1]-valued variables

Statement

If X[0,1]X \in [0, 1] almost surely with E[X]=μ\mathbb{E}[X] = \mu, then for any a(0,1)a \in (0, 1):

Pr[X>1a]μ(1a)a.\Pr[X > 1 - a] \geq \frac{\mu - (1 - a)}{a}.

Exact statement

LaTeX source for copy/export

\Pr[X > 1 - a] \geq \frac{\mu - (1 - a)}{a}

Intuition

Markov bounds the upper tail of a nonnegative variable. The same idea applied to 1X1 - X bounds the lower tail of XX. The lemma is most useful when XX is something like an accuracy in [0,1][0, 1] and you want a guarantee that XX is rarely far below its mean.

Proof Sketch

1X[0,1]1 - X \in [0, 1] is nonnegative with E[1X]=1μ\mathbb{E}[1 - X] = 1 - \mu. Markov's inequality gives Pr[1Xa](1μ)/a\Pr[1 - X \geq a] \leq (1 - \mu)/a. Take complements:

Pr[X>1a]=Pr[1X<a]11μa=μ(1a)a.\Pr[X > 1 - a] = \Pr[1 - X < a] \geq 1 - \frac{1 - \mu}{a} = \frac{\mu - (1 - a)}{a}.

Why It Matters

This is the form of Markov used in the Shalev-Shwartz and Ben-David proof of agnostic PAC learnability of finite hypothesis classes. The argument needs a lower bound on the probability that the empirical risk is not too far below the true risk; Lemma B.1 supplies that bound from a one-sided expectation control.

Failure Mode

The bound is informative only when μ>1a\mu > 1 - a. If μ1a\mu \leq 1 - a, the right-hand side is non-positive and the lemma reduces to the trivial bound Pr[]0\Pr[\cdot] \geq 0. Like Markov itself, the bound uses only the mean and the boundedness range; it cannot see variance.

Theorem

Chebyshev's Inequality

Statement

For any random variable XX with mean μ\mu and finite variance σ2\sigma^2, for any t>0t > 0:

Pr[Xμt]σ2t2\Pr[|X - \mu| \geq t] \leq \frac{\sigma^2}{t^2}

Exact statement

LaTeX source for copy/export

\Pr[|X - \mu| \geq t] \leq \frac{\sigma^2}{t^2}

Intuition

Chebyshev uses variance information: if the spread of XX around its mean is small (σ2\sigma^2 is small), then large deviations are unlikely. Applied to the sample mean, the variance decreases as 1/n1/n, giving the familiar O(1/n)O(1/n) concentration.

Proof Sketch

Apply Markov's inequality to the nonnegative random variable (Xμ)2(X - \mu)^2. The square is monotone on nonnegative arguments, so Xμt|X - \mu| \geq t if and only if (Xμ)2t2(X - \mu)^2 \geq t^2:

Pr[Xμt]=Pr[(Xμ)2t2]E[(Xμ)2]t2=σ2t2.\Pr[|X - \mu| \geq t] = \Pr[(X - \mu)^2 \geq t^2] \leq \frac{\mathbb{E}[(X - \mu)^2]}{t^2} = \frac{\sigma^2}{t^2}.

Chebyshev is Markov applied to the squared deviation. This is the first place in the chain where the variance enters: Markov uses only E[X]\mathbb{E}[X], while Chebyshev squeezes one extra moment out of the same trick.

Why It Matters

Chebyshev gives the first quantitative form of the law of large numbers. For i.i.d. Z1,,ZmZ_1, \ldots, Z_m with mean μ\mu and Var(Zi)=σ2\operatorname{Var}(Z_i) = \sigma^2, the sample mean Zˉm=1miZi\bar Z_m = \frac{1}{m}\sum_i Z_i has variance

Var(Zˉm)=1m2i=1mVar(Zi)=σ2m\operatorname{Var}(\bar Z_m) = \frac{1}{m^2}\sum_{i=1}^m \operatorname{Var}(Z_i) = \frac{\sigma^2}{m}

(by independence the cross-terms vanish). Combining this with Chebyshev gives Pr[Zˉmμt]σ2/(mt2)\Pr[|\bar Z_m - \mu| \geq t] \leq \sigma^2/(m t^2). Equivalently, with probability 1δ\geq 1 - \delta, Zˉmμσ/mδ|\bar Z_m - \mu| \leq \sigma/\sqrt{m\delta}. Note the 1/δ1/\delta dependence, not log(1/δ)\log(1/\delta).

Failure Mode

The 1/t21/t^2 decay is still polynomial. Chebyshev cannot give you the log(1/δ)\log(1/\delta) dependence needed for efficient learning. For bounded random variables, Hoeffding gives exponentially better tails. Chebyshev is most useful when you know the variance but nothing about higher moments or boundedness.

Lemma

Lemma B.2: Chebyshev for the sample mean

Statement

Let Z1,,ZmZ_1, \ldots, Z_m be i.i.d. with E[Zi]=μ\mathbb{E}[Z_i] = \mu and Var(Zi)1\operatorname{Var}(Z_i) \leq 1. For any δ(0,1)\delta \in (0, 1), with probability at least 1δ1 - \delta:

1mi=1mZiμ1δm.\left|\frac{1}{m}\sum_{i=1}^m Z_i - \mu\right| \leq \sqrt{\frac{1}{\delta m}}.

Exact statement

LaTeX source for copy/export

\left|\frac{1}{m}\sum_{i=1}^m Z_i - \mu\right| \leq \sqrt{\frac{1}{\delta m}}

Intuition

Solve Chebyshev for the deviation. Setting the failure probability σ2/(mt2)=δ\sigma^2/(m t^2) = \delta and inverting for tt gives t=σ/mδt = \sigma/\sqrt{m \delta}. With the normalization σ21\sigma^2 \leq 1, the radius is 1/mδ1/\sqrt{m \delta}.

Proof Sketch

Apply Chebyshev to Zˉm\bar Z_m with variance σ2/m1/m\sigma^2/m \leq 1/m:

Pr ⁣[Zˉmμt]1mt2.\Pr\!\left[|\bar Z_m - \mu| \geq t\right] \leq \frac{1}{m t^2}.

Set the right-hand side equal to δ\delta and solve: t=1/δmt = 1/\sqrt{\delta m}. The complement event has probability 1δ\geq 1 - \delta.

Why It Matters

This is the explicit finite-mm form behind "the sample mean is close to μ\mu." It is the version of Chebyshev quoted as Lemma B.2 in Shalev-Shwartz and Ben-David, Understanding Machine Learning, Appendix B, where it sets up the polynomial sample-complexity baseline that exponential bounds later improve. The decay rate of the radius is 1/δm1/\sqrt{\delta m}: polynomial in mm, with a 1/δ1/\sqrt{\delta} dependence on the confidence parameter.

Failure Mode

The 1/δm1/\sqrt{\delta m} rate is too slow for PAC-style sample complexity. For confidence δ=0.05\delta = 0.05 and target accuracy ε=0.01\varepsilon = 0.01, the lemma needs m1/(δε2)=200,000m \geq 1/(\delta \varepsilon^2) = 200{,}000 samples; Hoeffding under the same boundedness assumption needs roughly log(2/δ)/(2ε2)18,500\log(2/\delta)/(2\varepsilon^2) \approx 18{,}500, an order of magnitude fewer. The 1/δ1/\delta dependence is the bottleneck.

From polynomial to exponential

Chebyshev gives a confidence interval of width 1/δm1/\sqrt{\delta m} — polynomial in mm, polynomial in 1/δ1/\delta. PAC learning needs better.

For accuracy ε=0.01\varepsilon = 0.01 at confidence δ=0.05\delta = 0.05, Lemma B.2 demands m1/(δε2)=200,000m \geq 1/(\delta \varepsilon^2) = 200{,}000 samples. The bottleneck is the 1/δ1/\delta dependence: halving the failure probability doubles the required mm.

Hoeffding's inequality, derived next, replaces this with exponential decay: for bounded random variables in [0,1][0, 1],

Pr ⁣[Zˉmμε]2exp(2mε2).\Pr\!\left[|\bar Z_m - \mu| \geq \varepsilon\right] \leq 2 \exp(-2 m \varepsilon^2).

Inverting: mlog(2/δ)/(2ε2)m \geq \log(2/\delta)/(2 \varepsilon^2). The same ε=0.01\varepsilon = 0.01, δ=0.05\delta = 0.05 now needs m18,500m \approx 18{,}500 — about 1010 times fewer samples than Chebyshev. More importantly, halving δ\delta adds only log2/(2ε2)3,500\log 2 / (2 \varepsilon^2) \approx 3{,}500 samples, a constant overhead instead of doubling.

The mechanism is the same Markov trick, applied to a different function. Markov uses only E[X]\mathbb{E}[X]. Chebyshev applies Markov to (Xμ)2(X - \mu)^2 and picks up the variance. The Chernoff/MGF method (used by Hoeffding and Bernstein) applies Markov to eλ(Xμ)e^{\lambda (X - \mu)} — the moment generating function — and picks up all polynomial moments at once. The exponential function dominates every polynomial; that is why exponential tails dominate polynomial tails.

InequalityApplied toTail decay in mm at fixed ε\varepsilonδ\delta dependence
MarkovXXnone1/δ1/\delta
Chebyshev(Xμ)2(X - \mu)^21/m1/m1/δ1/\delta
Hoeffdingeλ(Xμ)e^{\lambda (X - \mu)}exp(cmε2)\exp(-c m \varepsilon^2)log(1/δ)\log(1/\delta)

The polynomial-vs-exponential gap is what makes finite-sample learning theory work.

Lemma

Sub-Gaussian Finite-Sum Tail Bound

Statement

Let (Yi)iI(Y_i)_{i \in I} be a finite independent family of centered sub-Gaussian random variables with MGF parameters cic_i. Then for any ϵ0\epsilon \geq 0:

Pr ⁣[iIYiϵ]exp ⁣(ϵ22iIci).\Pr\!\left[\sum_{i \in I} Y_i \geq \epsilon\right] \leq \exp\!\left(-\frac{\epsilon^2}{2\sum_{i \in I} c_i}\right).
Exact statement

LaTeX source for copy/export

\Pr\!\left[\sum_{i \in I} Y_i \geq \epsilon\right] \leq \exp\!\left(-\frac{\epsilon^2}{2\sum_{i \in I} c_i}\right)

Intuition

Sub-Gaussian MGF control behaves like variance under independent sums: the parameters add. Markov's inequality applied to exp(λiYi)\exp(\lambda \sum_i Y_i) then converts that MGF bound into an exponential tail bound.

Proof Sketch

Independence factors the MGF of the sum into a product of individual MGFs. The sub-Gaussian assumption bounds each factor by exp(ciλ2/2)\exp(c_i \lambda^2 / 2), so the product is bounded by exp(λ2ici/2)\exp(\lambda^2 \sum_i c_i / 2). Apply Markov to the exponential transform and optimize λ\lambda.

Why It Matters

This is the exact formal bridge between concentration and finite-class learning bounds. Hoeffding for bounded losses first proves sub-Gaussian MGF control; the finite union bound then lifts the single-hypothesis tail bound to all hypotheses.

Failure Mode

The conclusion is a one-sided tail bound for a finite independent sum. It does not by itself prove the bounded-variable Hoeffding theorem or a uniform convergence theorem; those need additional assumptions and a union-bound step.

Theorem

Hoeffding One-Sided Finite-Sum Bound

Statement

Let (Xi)iI(X_i)_{i \in I} be a finite independent family of real random variables with aiXibia_i \leq X_i \leq b_i almost surely. Then for ϵ0\epsilon \geq 0:

Pr ⁣[iI(XiEXi)ϵ]exp ⁣(ϵ22iI((biai)/2)2).\Pr\!\left[\sum_{i \in I} (X_i-\mathbb{E}X_i) \geq \epsilon\right] \leq \exp\!\left(-\frac{\epsilon^2}{2\sum_{i \in I} ((b_i-a_i)/2)^2}\right).
Exact statement

LaTeX source for copy/export

\Pr\!\left[\sum_{i \in I} (X_i-\mathbb{E}X_i) \geq \epsilon\right] \leq \exp\!\left(-\frac{\epsilon^2}{2\sum_{i \in I} ((b_i-a_i)/2)^2}\right)

Intuition

Bounded variables are sub-Gaussian after centering. Once each centered variable has sub-Gaussian MGF parameter ((biai)/2)2((b_i-a_i)/2)^2, independence lets those parameters add across the finite sum.

Proof Sketch

Apply Hoeffding's lemma to each bounded variable to obtain sub-Gaussian MGF control for XiEXiX_i-\mathbb{E}X_i. Independence preserves that control under finite summation, and the sub-Gaussian finite-sum tail bound above converts the MGF statement into the displayed probability bound.

Why It Matters

This is the claim-facing bridge between bounded losses and finite-class uniform convergence. Setting ϵ=nt\epsilon = nt gives the one-sided average Hoeffding form; applying the same argument to Xi-X_i and union-bounding the two tails gives the common two-sided display.

Failure Mode

This statement covers the one-sided centered finite-sum bound. It does not by itself verify the two-sided display, the identical-bound special case, or a uniform convergence theorem; those need their own scoped claims or corollaries.

Theorem

Hoeffding's Inequality

Statement

Let X1,,XnX_1, \ldots, X_n be independent random variables with aiXibia_i \leq X_i \leq b_i almost surely. Then for any t>0t > 0:

Pr ⁣[Xˉnμt]exp ⁣(2n2t2i=1n(biai)2)\Pr\!\left[\bar{X}_n - \mu \geq t\right] \leq \exp\!\left(-\frac{2n^2t^2}{\sum_{i=1}^n (b_i - a_i)^2}\right)

For the two-sided version:

Pr ⁣[Xˉnμt]2exp ⁣(2n2t2i=1n(biai)2)\Pr\!\left[|\bar{X}_n - \mu| \geq t\right] \leq 2\exp\!\left(-\frac{2n^2t^2}{\sum_{i=1}^n (b_i - a_i)^2}\right)

Special case: If all Xi[a,b]X_i \in [a, b] (identically bounded):

Pr[Xˉnμt]2exp ⁣(2nt2(ba)2)\Pr[|\bar{X}_n - \mu| \geq t] \leq 2\exp\!\left(-\frac{2nt^2}{(b-a)^2}\right)

Exact statement

LaTeX source for copy/export

\Pr\!\left[|\bar{X}_n - \mu| \geq t\right] \leq 2\exp\!\left(-\frac{2n^2t^2}{\sum_{i=1}^n (b_i-a_i)^2}\right)

Intuition

Bounded random variables have sub-Gaussian tails. The width of the bounding interval (biai)(b_i - a_i) controls the "effective variance" of each variable. The bound says: the probability of a deviation of size tt decays exponentially in nt2nt^2, with the rate determined by how spread out the bounds are.

Proof Sketch

The proof uses the Chernoff/MGF method:

  1. For any λ>0\lambda > 0: Pr[Xˉnμt]=Pr[eλ(Xˉnμ)eλt]eλtE[eλ(Xˉnμ)]\Pr[\bar{X}_n - \mu \geq t] = \Pr[e^{\lambda(\bar{X}_n - \mu)} \geq e^{\lambda t}] \leq e^{-\lambda t} \mathbb{E}[e^{\lambda(\bar{X}_n - \mu)}]

  2. By independence: E[eλ(Xˉnμ)]=i=1nE[eλ(Xiμi)/n]\mathbb{E}[e^{\lambda(\bar{X}_n - \mu)}] = \prod_{i=1}^n \mathbb{E}[e^{\lambda(X_i - \mu_i)/n}]

  3. Hoeffding's lemma: If aXba \leq X \leq b, then E[eλ(XE[X])]exp(λ2(ba)2/8)\mathbb{E}[e^{\lambda(X - \mathbb{E}[X])}] \leq \exp(\lambda^2(b-a)^2/8)

  4. Combine and optimize over λ\lambda (set λ=4nt/(biai)2\lambda = 4nt/\sum(b_i - a_i)^2).

Why It Matters

Hoeffding is the workhorse of learning theory. The ERM generalization bound for finite classes, the uniform convergence bound, and many other results use Hoeffding at their core. The exponential tail gives n=O((ba)2log(1/δ)/ϵ2)n = O((b-a)^2 \log(1/\delta)/\epsilon^2). The log(1/δ)\log(1/\delta) dependence is what makes high-confidence bounds feasible.

Failure Mode

Hoeffding uses only the range [ai,bi][a_i, b_i], not the actual variance. If the variance is much smaller than (biai)2/4(b_i - a_i)^2/4 (e.g., a variable that is usually 0 but can be as large as 1), Hoeffding is loose. Bernstein fixes this by incorporating variance information.

Theorem

Bernstein's Inequality

Statement

Let X1,,XnX_1, \ldots, X_n be independent with E[Xi]=μi\mathbb{E}[X_i] = \mu_i, Var(Xi)=σi2\text{Var}(X_i) = \sigma_i^2, and XiμiM|X_i - \mu_i| \leq M almost surely. Then for any t>0t > 0:

Pr ⁣[i=1n(Xiμi)t]exp ⁣(t2/2i=1nσi2+Mt/3)\Pr\!\left[\sum_{i=1}^n (X_i - \mu_i) \geq t\right] \leq \exp\!\left(-\frac{t^2/2}{\sum_{i=1}^n \sigma_i^2 + Mt/3}\right)

For the sample mean with i.i.d. variables (σ2=Var(Xi)\sigma^2 = \text{Var}(X_i)):

Pr[Xˉnμt]2exp ⁣(nt2/2σ2+Mt/3)\Pr[|\bar{X}_n - \mu| \geq t] \leq 2\exp\!\left(-\frac{nt^2/2}{\sigma^2 + Mt/3}\right)

Exact statement

LaTeX source for copy/export

\Pr\!\left[\sum_{i=1}^n (X_i - \mu_i) \geq t\right] \leq \exp\!\left(-\frac{t^2/2}{\sum_{i=1}^n \sigma_i^2 + Mt/3}\right)

Intuition

Bernstein interpolates between two regimes. For small deviations (tσ2/Mt \ll \sigma^2/M), the σ2\sigma^2 term dominates the denominator and the bound looks like exp(nt2/2σ2)\exp(-nt^2/2\sigma^2), a variance-dependent Gaussian tail. For large deviations (tσ2/Mt \gg \sigma^2/M), the Mt/3Mt/3 term dominates and the bound looks like exp(3nt/2M)\exp(-3nt/2M), a sub-exponential tail. The variance term makes Bernstein much tighter than Hoeffding when the variance is small relative to the range.

Proof Sketch

Like Hoeffding, use the Chernoff method. The key improvement is a tighter bound on the MGF:

  1. If XμM|X - \mu| \leq M and Var(X)=σ2\text{Var}(X) = \sigma^2, then E[eλ(Xμ)]exp ⁣(λ2σ2/21λM/3)\mathbb{E}[e^{\lambda(X-\mu)}] \leq \exp\!\left(\frac{\lambda^2 \sigma^2/2}{1 - \lambda M/3}\right) for 0<λ<3/M0 < \lambda < 3/M.

  2. This bound uses the variance σ2\sigma^2 instead of the crude range bound (ba)2/4(b-a)^2/4 that Hoeffding uses.

  3. Multiply over independent variables and optimize λ\lambda.

Why It Matters

Bernstein is the right inequality when you have variance information. In learning theory, this matters for sparse or low-noise settings. For example, if the loss is usually close to zero (low empirical risk, low variance), Bernstein gives much tighter bounds than Hoeffding, because Hoeffding only knows the loss is in [0,1][0, 1] while Bernstein knows the variance is small.

Failure Mode

Bernstein requires knowing (or bounding) the variance, which is not always available. If you only know the range [a,b][a, b], Hoeffding is simpler and sufficient. In practice, you can sometimes use an empirical estimate of the variance and apply Bernstein, but this introduces additional technical complications (you need a concentration bound on the variance estimate itself).

Proof Ideas and Templates Used

All four inequalities follow a common escalation pattern:

  1. Markov's trick: the universal first step: convert a tail probability into an expectation bound via Pr[Xt]E[g(X)]/g(t)\Pr[X \geq t] \leq \mathbb{E}[g(X)]/g(t) for any increasing gg.

  2. Moment method: apply Markov to (Xμ)2k(X - \mu)^{2k} to get Pr[Xμt]E[(Xμ)2k]/t2k\Pr[|X - \mu| \geq t] \leq \mathbb{E}[(X-\mu)^{2k}]/t^{2k}. Chebyshev is the k=1k = 1 case.

  3. MGF/Chernoff method: apply Markov to eλXe^{\lambda X} for optimal λ\lambda. This is exponentially stronger because eλXe^{\lambda X} grows much faster than any polynomial. Both Hoeffding and Bernstein use this.

The key lemma in each exponential bound is controlling E[eλ(Xμ)]\mathbb{E}[e^{\lambda(X - \mu)}]:

  • Hoeffding's lemma bounds it using only boundedness: eλ2(ba)2/8\leq e^{\lambda^2(b-a)^2/8}
  • Bernstein's condition bounds it using variance: eλ2σ2/2(1λM/3)1\leq e^{\lambda^2\sigma^2/2 \cdot (1 - \lambda M/3)^{-1}}

Canonical Examples

Example

Estimating a coin's bias with Hoeffding

Flip a coin nn times. Each flip Xi{0,1}X_i \in \{0, 1\} with E[Xi]=p\mathbb{E}[X_i] = p. The sample mean Xˉn\bar{X}_n estimates pp. Hoeffding with a=0,b=1a = 0, b = 1 gives:

Pr[Xˉnpϵ]2e2nϵ2\Pr[|\bar{X}_n - p| \geq \epsilon] \leq 2e^{-2n\epsilon^2}

For ϵ=0.01\epsilon = 0.01 and δ=0.05\delta = 0.05: nlog(2/0.05)2(0.01)2=3.690.000218,445n \geq \frac{\log(2/0.05)}{2(0.01)^2} = \frac{3.69}{0.0002} \approx 18{,}445.

If we know p0.01p \approx 0.01 (rare event), Bernstein with σ2=p(1p)0.01\sigma^2 = p(1-p) \approx 0.01 and M1M \leq 1 gives roughly n985n \approx 985 — about 19 times fewer samples. Bernstein exploits the small variance.

Example

Comparing Markov, Chebyshev, and Hoeffding

Let X1,,X100X_1, \ldots, X_{100} be i.i.d. uniform on [0,1][0, 1]. We want Pr[Xˉ1000.50.1]\Pr[|\bar{X}_{100} - 0.5| \geq 0.1].

Markov + Jensen (applied to Xˉ0.5|\bar{X} - 0.5|, crude): first note that

Var(Xˉ)=112008.33×104.\operatorname{Var}(\bar{X}) = \frac{1}{1200} \approx 8.33 \times 10^{-4}.

So the standard deviation of Xˉ\bar{X} is about 0.02890.0289.

Then

Pr[Xˉ1000.50.1]E[Xˉ0.5]0.1Var(Xˉ)0.10.02890.10.289.\Pr[|\bar{X}_{100} - 0.5| \geq 0.1] \leq \frac{\mathbb{E}[|\bar{X} - 0.5|]}{0.1} \leq \frac{\sqrt{\operatorname{Var}(\bar{X})}}{0.1} \approx \frac{0.0289}{0.1} \approx 0.289.

The first step is Markov on the non-negative random variable Xˉ0.5|\bar{X} - 0.5|; the second step is Jensen's inequality applied to zzz \mapsto \sqrt{z}, giving E[Xˉ0.5]E[(Xˉ0.5)2]=Var(Xˉ)\mathbb{E}[|\bar{X}-0.5|] \leq \sqrt{\mathbb{E}[(\bar{X}-0.5)^2]} = \sqrt{\operatorname{Var}(\bar{X})}. Pure Markov on Xˉ0.5|\bar{X} - 0.5| alone, without Jensen, would only give a bound in terms of EXˉ0.5\mathbb{E}|\bar{X} - 0.5| and not in terms of the standard deviation.

Chebyshev: Var(Xˉ)/0.01=(1/121/100)/0.01=0.083\leq \text{Var}(\bar{X})/0.01 = (1/12 \cdot 1/100)/0.01 = 0.083.

Hoeffding: 2e21000.01=2e20.27\leq 2e^{-2 \cdot 100 \cdot 0.01} = 2e^{-2} \approx 0.27. Hoeffding gives 0.27, which is worse than Chebyshev's 0.083.

Why: the Hoeffding exponent here is 2nt2=22nt^2 = 2, so e20.135e^{-2} \approx 0.135 and the two-sided bound is 2e20.272e^{-2} \approx 0.27. Chebyshev's σ2/(nt2)=(1/12)/10.083\sigma^2/(n t^2) = (1/12)/1 \approx 0.083 is tighter because the exponent is simply not large enough for e2nt2e^{-2nt^2} to beat 1/t21/t^2 at this (n,t)(n, t). Hoeffding's exponential form dominates once 2nt22nt^2 is sufficiently larger than 2log(1/t2)2\log(1/t^2). For n=400n = 400 at the same t=0.1t = 0.1, Hoeffding gives 2e86.7×1042e^{-8} \approx 6.7 \times 10^{-4}, beating Chebyshev's 1/480.0211/48 \approx 0.021.

No single inequality is universally best at all sample sizes.

Example

Hoeffding in learning theory: the finite-class ERM bound

The ERM generalization bound for H<|\mathcal{H}| < \infty:

For each fixed hh, the loss (h(xi),yi)\ell(h(x_i), y_i) is a bounded random variable in [0,1][0, 1]. The empirical risk R^n(h)\hat{R}_n(h) is the sample mean. Hoeffding gives:

Pr[R(h)R^n(h)>ϵ]2e2nϵ2\Pr[|R(h) - \hat{R}_n(h)| > \epsilon] \leq 2e^{-2n\epsilon^2}

Union bound over all hHh \in \mathcal{H}:

Pr[suphR(h)R^n(h)>ϵ]2He2nϵ2\Pr[\sup_h |R(h) - \hat{R}_n(h)| > \epsilon] \leq 2|\mathcal{H}|e^{-2n\epsilon^2}

This is how Hoeffding feeds directly into generalization bounds.

Common Confusions

Watch Out

Hoeffding requires bounded random variables, not just bounded variance

Hoeffding's inequality assumes aiXibia_i \leq X_i \leq b_i almost surely. The random variables must be literally bounded. It does not apply to Gaussian random variables (which are unbounded), even though they have finite variance. For Gaussians, you use the exact sub-Gaussian tail Pr[Xμt]=Φ(t/σ)\Pr[X - \mu \geq t] = \Phi(-t/\sigma), or you can use the sub-Gaussian framework that generalizes Hoeffding. If someone applies Hoeffding to an unbounded variable, the result is invalid.

Watch Out

Bernstein is better than Hoeffding when variance is small, not always

Bernstein's bound is exp(nt2/(2σ2+2Mt/3))\exp(-nt^2/(2\sigma^2 + 2Mt/3)), while Hoeffding gives exp(2nt2/(ba)2)\exp(-2nt^2/(b-a)^2). If σ2(ba)2/4\sigma^2 \approx (b-a)^2/4 (i.e., the variance is as large as it can be for the given range), Bernstein offers little or no improvement. Bernstein shines when σ2(ba)2\sigma^2 \ll (b-a)^2. For example, when the random variable is usually near zero but has occasional large values. In the worst case over all distributions with given range, Hoeffding and Bernstein are comparable.

Watch Out

The two-sided Hoeffding bound is 2x the one-sided bound, not squared

A common error: writing Pr[Xˉμt]2e2nt2/(ba)2\Pr[|\bar{X} - \mu| \geq t] \leq 2e^{-2nt^2/(b-a)^2}. The factor of 2 comes from the union bound over the two tails: Pr[Xˉμt]Pr[Xˉμt]+Pr[μXˉt]\Pr[|\bar{X} - \mu| \geq t] \leq \Pr[\bar{X} - \mu \geq t] + \Pr[\mu - \bar{X} \geq t]. Students sometimes confuse this with squaring the one-sided probability.

Summary

  • Markov: Pr[Xt]E[X]/t\Pr[X \geq t] \leq \mathbb{E}[X]/t. Weakest, needs only X0X \geq 0
  • Chebyshev: Pr[Xμt]σ2/t2\Pr[|X - \mu| \geq t] \leq \sigma^2/t^2. Uses variance, 1/t21/t^2 decay
  • Hoeffding: Pr[Xˉnμt]2exp(2nt2/(ba)2)\Pr[|\bar{X}_n - \mu| \geq t] \leq 2\exp(-2nt^2/(b-a)^2). Exponential decay, needs boundedness
  • Bernstein: like Hoeffding but uses variance σ2\sigma^2. Tighter when σ2(ba)2\sigma^2 \ll (b-a)^2
  • Exponential tails give log(1/δ)\log(1/\delta) dependence in sample complexity; polynomial tails give 1/δ1/\delta
  • The Chernoff/MGF method is the universal technique for deriving exponential bounds
  • In learning theory, Hoeffding is the default; use Bernstein when you have variance information

Exercises

ExerciseCore

Problem

Using Hoeffding's inequality, how many fair coin flips do you need to estimate the probability of heads to within ±0.02\pm 0.02 with probability at least 0.99?

ExerciseCore

Problem

Suppose X1,,XnX_1, \ldots, X_n are i.i.d. with Xi[0,1]X_i \in [0, 1], E[Xi]=0.01\mathbb{E}[X_i] = 0.01, and Var(Xi)=0.01\text{Var}(X_i) = 0.01. Compare the sample sizes needed by Hoeffding and Bernstein to guarantee Pr[Xˉn0.010.01]0.05\Pr[|\bar{X}_n - 0.01| \geq 0.01] \leq 0.05.

ExerciseAdvanced

Problem

Prove Hoeffding's lemma: if aXba \leq X \leq b and E[X]=0\mathbb{E}[X] = 0, then for all λ>0\lambda > 0:

E[eλX]exp ⁣(λ2(ba)28)\mathbb{E}[e^{\lambda X}] \leq \exp\!\left(\frac{\lambda^2(b-a)^2}{8}\right)

Related Comparisons

References

Reference note: this page is a pedagogical guide to the standard scalar concentration inequalities, not an exhaustive survey of martingale, matrix-valued, or geometric concentration.

Canonical:

  • Hoeffding, W. (1963). "Probability Inequalities for Sums of Bounded Random Variables." Journal of the American Statistical Association.
  • Boucheron, Lugosi, Massart, Concentration Inequalities: A Nonasymptotic Theory of Independence (2013), Chapters 2-3
  • Shalev-Shwartz & Ben-David, Understanding Machine Learning, Appendix B

Current:

  • Vershynin, High-Dimensional Probability (2018), Chapters 2-3
  • Wainwright, High-Dimensional Statistics (2019), Chapter 2
  • van Handel, Probability in High Dimension (2016), Chapters 1-3

Next Topics

Building on these basic inequalities:

Last reviewed: April 28, 2026

Claim evidence

Selected claims on this topic have machine-checked support.

Collapsed by default because this is audit material. Open it to see exact theorem names, claim scopes, and source roles.

Click to expand

4

claim matches

0

dependency proofs

0

incomplete markers

This is claim-level evidence, not a whole-page badge. The checked theorem must match the recorded claim scope. Supporting lemmas stay labeled as dependency proofs, not full claim matches.

Chebyshev inequality

Variance control gives a tail bound for deviations away from the mean.

Matches claim

Connects variance, moments, and concentration before stronger exponential bounds.

Formal record
Checked theorem
TheoremPath.Probability.Concentration.chebyshevInequalityVariance
Claim scope
real variance chebyshev inequality
Proof scope
exact mathlib wrapper for chebyshev
Mathlib theorem
ProbabilityTheory.meas_ge_le_variance_div_sq

Hoeffding one-sided finite-sum bound

A bounded independent finite sum has a one-sided exponential deviation bound.

Matches claim

A core theorem behind bounded-loss learning guarantees.

Formal record
Checked theorem
TheoremPath.Probability.Concentration.hoeffdingBoundedFiniteSumTail
Claim scope
finite centered sum bounded hoeffding one sided
Proof scope
exact mathlib wrapper for bounded hoeffding finite sum tail
Mathlib theorem
ProbabilityTheory.hasSubgaussianMGF_of_mem_Icc, ProbabilityTheory.iIndepFun.comp, ProbabilityTheory.HasSubgaussianMGF.measure_sum_ge_le_of_iIndepFun

Markov inequality

A nonnegative random variable with small expectation cannot be large very often.

Matches claim

First bridge from expectation control to a probability tail bound.

Formal record
Checked theorem
TheoremPath.Probability.Concentration.markovInequalityRealIntegrable
Claim scope
real integrable nonnegative markov inequality
Proof scope
exact mathlib wrapper for markov
Mathlib theorem
MeasureTheory.mul_meas_ge_le_integral_of_nonneg

Sub-Gaussian finite-sum tail bound

Independent sub-Gaussian variables have a finite-sum exponential tail bound.

Matches claim

Supports the concentration path toward finite-class generalization.

Formal record
Checked theorem
TheoremPath.Probability.Concentration.subGaussianFiniteSumTailBound
Claim scope
finite sum subgaussian tail bound
Proof scope
exact mathlib wrapper for subgaussian finite sum tail bound
Mathlib theorem
ProbabilityTheory.HasSubgaussianMGF.measure_sum_ge_le_of_iIndepFun

See the public evidence page for the display rules and representative Lean mapping examples.

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

12

Derived topics

24

+19 more on the derived-topics page.