Skip to main content

Concentration Probability

Bernstein Inequality

A variance-sensitive concentration inequality for independent bounded random variables. Bernstein sharpens Hoeffding when the variance is much smaller than the worst-case range.

CoreTier 1StableSupporting~25 min

Why This Matters

Hoeffding's inequality uses only the range of each summand. That is safe, but it can be loose when the summands are usually small. Bernstein's inequality adds one more piece of information: the variance. The reward is a tighter tail bound in low-variance or sparse-loss regimes.

This is the concentration inequality behind many variance-sensitive learning bounds. If a classifier makes rare errors, or a loss is usually near zero, a range-only bound treats the problem as if worst-case losses happen often. Bernstein separates two facts:

  1. the variables are bounded, so very large jumps cannot happen;
  2. the variance is small, so moderate deviations are rarer than Hoeffding predicts.

Mental Model

Bernstein has two regimes.

Deviation sizeDominant termTail shapeInterpretation
Small ttvariance vvexp(t2/(2v))\exp(-t^2/(2v))Gaussian-like concentration with the true variance
Large ttrange MtMtexp(ct/M)\exp(-ct/M)sub-exponential tail caused by bounded but nonzero jump size

Hoeffding behaves as if the variance were always near the worst-case range variance. Bernstein uses the actual variance proxy.

Formal Setup

Let X1,,XnX_1,\ldots,X_n be independent real-valued random variables. Write μi=E[Xi]\mu_i = \mathbb{E}[X_i], assume each centered variable is bounded by MM:

XiμiMalmost surely,|X_i - \mu_i| \leq M \quad \text{almost surely},

and define the variance proxy

v=i=1nVar(Xi).v = \sum_{i=1}^n \mathrm{Var}(X_i).

The quantity of interest is the centered sum

Sn=i=1n(Xiμi).S_n = \sum_{i=1}^n (X_i - \mu_i).

Theorem

Scalar Bernstein Inequality

Statement

If X1,,XnX_1,\ldots,X_n are independent, E[Xi]=μi\mathbb{E}[X_i]=\mu_i, XiμiM|X_i-\mu_i|\leq M almost surely, and v=i=1nVar(Xi)v=\sum_{i=1}^n \mathrm{Var}(X_i), then for every t>0t>0:

Pr ⁣[i=1n(Xiμi)t]exp ⁣(t2/2v+Mt/3).\Pr\!\left[\sum_{i=1}^n (X_i-\mu_i) \geq t\right]\leq \exp\!\left(-\frac{t^2/2}{v + Mt/3}\right).

A two-sided version follows by applying the same bound to Sn-S_n and using a union bound:

Pr[Snt]2exp ⁣(t2/2v+Mt/3).\Pr[|S_n|\geq t]\leq 2\exp\!\left(-\frac{t^2/2}{v + Mt/3}\right).

Intuition

The denominator has two terms. The variance term vv controls ordinary fluctuations near the mean. The bounded-jump term Mt/3Mt/3 prevents the bound from pretending the tail is Gaussian forever. Far enough from the mean, a single large-but-bounded jump becomes the limiting obstruction.

Proof Sketch

Use the Chernoff method. For λ>0\lambda>0:

Pr[Snt]eλtE[eλSn].\Pr[S_n\geq t]\leq e^{-\lambda t}\mathbb{E}[e^{\lambda S_n}].

Independence factors the moment generating function into a product. A Bernstein MGF lemma bounds each centered bounded summand:

E[eλ(Xiμi)]exp ⁣(λ2Var(Xi)2(1λM/3))\mathbb{E}[e^{\lambda(X_i-\mu_i)}]\leq \exp\!\left(\frac{\lambda^2\mathrm{Var}(X_i)}{2(1-\lambda M/3)}\right)

for 0<λ<3/M0<\lambda<3/M. Multiplying over ii gives:

E[eλSn]exp ⁣(λ2v2(1λM/3)).\mathbb{E}[e^{\lambda S_n}]\leq \exp\!\left(\frac{\lambda^2 v}{2(1-\lambda M/3)}\right).

Optimizing the resulting exponent over λ\lambda gives the displayed Bernstein bound.

Failure Mode

The bounded-deviation assumption is doing real work. If the variables are unbounded but merely have finite variance, Bernstein does not apply. Use Chebyshev, a sub-exponential condition, or a truncation argument instead.

Corollary

Bernstein Bound for a Sample Mean

Statement

If X1,,XnX_1,\ldots,X_n are i.i.d., E[Xi]=μ\mathbb{E}[X_i]=\mu, XiμM|X_i-\mu|\leq M almost surely, and Var(Xi)=σ2\mathrm{Var}(X_i)=\sigma^2, then:

Pr[Xˉnμϵ]2exp ⁣(nϵ2/2σ2+Mϵ/3).\Pr[|\bar{X}_n-\mu|\geq \epsilon]\leq 2\exp\!\left(-\frac{n\epsilon^2/2}{\sigma^2 + M\epsilon/3}\right).

Why It Matters

This is the form used in statistics and learning theory. It says the sample mean concentrates at a rate controlled by the true variance for small ϵ\epsilon, while still respecting the bounded range for larger deviations.

Bernstein vs. Hoeffding

For Xi[0,1]X_i\in[0,1], Hoeffding gives:

Pr[Xˉnμϵ]2e2nϵ2.\Pr[|\bar{X}_n-\mu|\geq \epsilon]\leq 2e^{-2n\epsilon^2}.

Bernstein gives:

Pr[Xˉnμϵ]2exp ⁣(nϵ2/2σ2+ϵ/3).\Pr[|\bar{X}_n-\mu|\geq \epsilon]\leq 2\exp\!\left(-\frac{n\epsilon^2/2}{\sigma^2+\epsilon/3}\right).

If σ2\sigma^2 is near 1/41/4, the two bounds are comparable up to constants. If σ21\sigma^2\ll 1, Bernstein can be much tighter.

Example

Suppose Xi[0,1]X_i\in[0,1], E[Xi]=0.01\mathbb{E}[X_i]=0.01, and Var(Xi)0.01\mathrm{Var}(X_i)\approx 0.01. To bound Pr[Xˉn0.010.01]\Pr[|\bar{X}_n-0.01|\geq 0.01], Hoeffding uses only the range and behaves like exp(0.0002n)\exp(-0.0002n). Bernstein uses the variance and behaves like exp(0.00375n)\exp(-0.00375n) under the common two-sided sample-mean form. The exponent is about 19 times larger, so the required sample size is far smaller.

Numeric Comparison: Sample Sizes Across Variance Regimes

How much does Bernstein actually buy you over Hoeffding? Suppose Xi[0,1]X_i \in [0, 1] i.i.d., target deviation ϵ=0.05\epsilon = 0.05, confidence 1δ=0.951 - \delta = 0.95. The minimum sample size to certify Pr[Xˉnμϵ]δ\Pr[|\bar{X}_n - \mu| \ge \epsilon] \le \delta via each bound:

σ2\sigma^2Hoeffding nn (ϵ2log(2/δ)/2\epsilon^{-2} \log(2/\delta)/2)Bernstein nn (variance-sensitive)Ratio
0.250.25 (Bernoulli at μ=0.5\mu = 0.5)7387380,626\phantom{0,}6261.18×1.18\times
0.100.107387380,315\phantom{0,}3152.34×2.34\times
0.010.01 (sparse loss, μ0.01\mu \approx 0.01)7387380,0,59\phantom{0,0,}5912.5×12.5\times
0.0010.001 (very sparse, μ0.001\mu \approx 0.001)7387380,0,30\phantom{0,0,}3024.6×24.6\times

Three observations:

  1. Hoeffding ignores variance. The 738738 figure is constant across rows because Hoeffding only uses the range [0,1][0, 1].
  2. Bernstein scales with σ2\sigma^2. For sparse losses (small mean, small variance), Bernstein needs a fraction of the samples Hoeffding does.
  3. The break-even is around σ2=0.25\sigma^2 = 0.25. For Bernoulli(0.5)(0.5), the variance equals the worst-case range variance, so Bernstein's edge collapses. This is the regime where Hoeffding is essentially optimal.

The takeaway: Bernstein is the right default when losses are bounded and usually small. In learning theory this is standard for cross-entropy under low-noise data, in PAC-Bayes for stochastic predictors with low empirical risk, and in variance-reduced Monte Carlo estimators.

The Empirical Bernstein Inequality

The classical Bernstein bound requires knowing the true variance σ2\sigma^2. In practice you only have the empirical variance σ^n2=1n1i(XiXˉn)2\hat{\sigma}^2_n = \frac{1}{n-1} \sum_i (X_i - \bar{X}_n)^2. The empirical Bernstein inequality (Maurer-Pontil 2009) lets you plug in σ^n2\hat{\sigma}^2_n at the cost of a small extra term.

Theorem

Empirical Bernstein Inequality (Maurer-Pontil 2009)

Statement

For i.i.d. Xi[0,M]X_i \in [0, M], with probability at least 1δ1 - \delta:

Xˉnμ2σ^n2log(2/δ)n+7Mlog(2/δ)3(n1).|\bar{X}_n - \mu| \le \sqrt{\frac{2 \hat{\sigma}^2_n \log(2/\delta)}{n}} + \frac{7 M \log(2/\delta)}{3(n - 1)}.

The first term is Bernstein's variance-sensitive contribution with σ2\sigma^2 replaced by the empirical variance. The second term is the unavoidable price of estimating the variance from data.

Intuition

Two random objects (Xˉn\bar{X}_n and σ^n2\hat{\sigma}^2_n) need to concentrate simultaneously. The tail bound for σ^n2\hat{\sigma}^2_n contributes the 1/(n1)1/(n-1) correction. For large nn, the σ^n2/n\sqrt{\hat{\sigma}^2_n / n} term dominates and the bound matches the population-variance Bernstein bound up to constants.

Why It Matters

This is what people use in practice: confidence intervals from a single sample without needing prior knowledge of σ2\sigma^2. It powers data-dependent generalization bounds (variance-sensitive PAC-Bayes), Monte Carlo confidence intervals, and adaptive bandit algorithms (UCB-V).

Failure Mode

The constant 77 (from Maurer-Pontil) is not sharp. Refinements (Audibert-Munos-Szepesvari 2009) tighten it. Also: the bound is data-dependent only in the variance term; the range MM still appears explicitly. For unbounded data with finite variance, see self-normalized inequalities (de la Peña, Lai, Shao 2009).

Common Confusions

Watch Out

Bernstein is not always smaller than Hoeffding

Bernstein is variance-sensitive, not magic. Constants matter. At small sample sizes or when the variance is not much smaller than the worst-case range variance, Hoeffding can be just as useful.

Watch Out

The variance must be controlled

The bound needs a variance proxy. If you estimate variance from the same sample, that estimate needs its own concentration argument. Plugging in an empirical variance without accounting for its error is a different theorem.

Watch Out

Bounded range and sub-exponential are related but not identical

The scalar Bernstein inequality above assumes bounded centered deviations. The sub-exponential Bernstein bound replaces boundedness with an MGF condition that holds near zero. The two statements have similar two-regime shapes, but their assumptions are not the same.

Exercises

ExerciseCore

Problem

Let Xi[0,1]X_i\in[0,1] be i.i.d. with Var(Xi)=0.01\mathrm{Var}(X_i)=0.01. Use the sample mean Bernstein bound to upper-bound Pr[Xˉnμ0.05]\Pr[|\bar{X}_n-\mu|\geq 0.05].

ExerciseAdvanced

Problem

Explain why Bernstein is often the right bound for finite-class generalization when the loss is bounded and usually near zero.

Related Comparisons

References

Canonical:

  • Bernstein, S. N. (1924). On a modification of Chebyshev's inequality and on the error in Laplace's formula. Collected Works (in Russian), IV (1964), 71-79. The original derivation of variance-sensitive concentration inequalities.
  • Boucheron, S., Lugosi, G., & Massart, P. (2013). Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press. Chapters 2-3 give the modern Chernoff-method derivation and the comparison with Bennett's inequality.
  • Massart, P. (2007). Concentration Inequalities and Model Selection. Springer (Saint-Flour Lecture Notes 33). Chapter 2 covers Bernstein in the context of empirical-process bounds.

Current:

  • Vershynin, R. (2018). High-Dimensional Probability. Cambridge University Press. Chapter 2 develops Bernstein for sub-exponential variables with sharp constants.
  • Wainwright, M. J. (2019). High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press. Chapter 2 connects Bernstein to MGF-based concentration and to matrix concentration.
  • Mohri, M., Rostamizadeh, A., & Talwalkar, A. (2018). Foundations of Machine Learning (2nd ed.). MIT Press. Chapters 2-3 use Bernstein for variance-sensitive PAC bounds.

Empirical Bernstein:

  • Maurer, A., & Pontil, M. (2009). Empirical Bernstein bounds and sample-variance penalization. COLT 2009. The empirical-Bernstein theorem: replace σ2\sigma^2 with σ^n2\hat{\sigma}^2_n at the cost of a 1/(n1)1/(n-1) correction.
  • Audibert, J.-Y., Munos, R., & Szepesvári, C. (2009). Exploration-exploitation tradeoff using variance estimates in multi-armed bandits. Theoretical Computer Science, 410(19), 1876-1902. Tighter constants in the empirical-Bernstein bound; introduces UCB-V.

Next Topics

Last reviewed: May 4, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

5

Derived topics

3