Skip to main content

Statistical Estimation

Law of Large Numbers

The weak and strong laws of large numbers: the sample mean converges to the population mean. Kolmogorov's conditions, the rate of convergence from the CLT, and why LLN justifies using empirical risk as a proxy for population risk.

CoreTier 1StableCore spine~50 min

Why This Matters

The law of large numbers is the most fundamental result in all of statistics. It says: if you average enough i.i.d. observations, the average converges to the expected value. This single fact justifies:

  • Empirical risk minimization: the training loss 1ni(h(xi),yi)\frac{1}{n}\sum_i \ell(h(x_i), y_i) converges to the population risk E[(h(X),Y)]\mathbb{E}[\ell(h(X), Y)] as nn \to \infty. Without the LLN, there is no reason to believe that minimizing training loss has anything to do with minimizing true risk.

  • Consistency of estimators: the sample mean Xˉn\bar{X}_n is a consistent estimator of μ=E[X]\mu = \mathbb{E}[X], where expectation is the population quantity we target. More generally, maximum likelihood estimators are consistent under regularity conditions, and the proof ultimately relies on the LLN.

  • Monte Carlo methods: estimating E[f(X)]\mathbb{E}[f(X)] by 1nif(Xi)\frac{1}{n}\sum_i f(X_i) works because of the LLN. Every MCMC method, every stochastic gradient estimator, and every simulation study rests on this.

If you do not understand the LLN, you do not understand why averaging works.

Want to see that separation visually? The Sampling and Limit Laws Lab shows one long running average and many reruns of the sample mean side by side, so you can see what the LLN claims and what it does not claim.

Quick Version

StatementMeaning
Weak LLNThe sample mean gets close in probability
Strong LLNThe sample mean settles almost surely along almost every path
Main assumptionFinite mean is enough for i.i.d. variables
What it does not sayIndividual draws become less random
μ0.20.30.40.50.60.70.8Sample mean0100200300400500Sample size nSequence 1Sequence 2Sequence 3All sequences converge to the true mean regardless of starting fluctuations

Three independent running sample means; Cauchy shows the pathology with no finite mean

μ0.070.500.931201401600800nX̄_n

Three independent sequences of running sample means. As n grows the trajectories collapse toward μ. Resample to draw a new triple.

Mental Model

Flip a fair coin 10 times: you might get 7 heads (70%). Flip it 1,000 times: you will almost certainly get close to 50%. Flip it a million times: the fraction of heads will be within 0.1% of 50% with overwhelming probability.

The LLN formalizes this: the sample average Xˉn=1ni=1nXi\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i converges to μ=E[Xi]\mu = \mathbb{E}[X_i] as nn \to \infty. The question is: what kind of convergence?

Modes of Convergence

Definition

Convergence in Probability

A sequence XnX_n converges in probability to XX if for every ϵ>0\epsilon > 0:

limnPr[XnX>ϵ]=0\lim_{n \to \infty} \Pr[|X_n - X| > \epsilon] = 0

This says: the probability of XnX_n being far from XX vanishes, but it does not rule out occasional deviations. There might be rare "bad" events where XnX_n is far from XX, as long as these events become increasingly unlikely.

Definition

Almost Sure Convergence

A sequence XnX_n converges almost surely (a.s.) to XX if and only if:

Pr ⁣[limnXn=X]=1\Pr\!\left[\lim_{n \to \infty} X_n = X\right] = 1

Equivalently: Pr[{ω:Xn(ω)X(ω)}]=1\Pr[\{\omega : X_n(\omega) \to X(\omega)\}] = 1.

This is strictly stronger than convergence in probability. Almost sure convergence says: for almost every outcome ω\omega, the sequence X1(ω),X2(ω),X_1(\omega), X_2(\omega), \ldots converges to X(ω)X(\omega). Not just "unlikely to be far away," but "actually converges along each path."

The relationship: Almost sure convergence implies convergence in probability, but not vice versa. The distinction matters in practice: convergence in probability allows for occasional large deviations that become rare; almost sure convergence says the sequence eventually settles down for each outcome.

Main Theorems

Theorem

Weak Law of Large Numbers (WLLN)

Statement

If X1,X2,X_1, X_2, \ldots are i.i.d. random variables with E[Xi]=μ<\mathbb{E}[X_i] = \mu < \infty, then the sample mean converges in probability to μ\mu:

Xˉn=1ni=1nXiPμ\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i \xrightarrow{P} \mu

That is, for every ϵ>0\epsilon > 0: limnPr[Xˉnμ>ϵ]=0\lim_{n \to \infty} \Pr[|\bar{X}_n - \mu| > \epsilon] = 0.

Intuition

Averaging reduces fluctuations. Each XiX_i fluctuates around μ\mu, but when you average many of them, the positive and negative deviations tend to cancel. The more you average, the less the sample mean fluctuates. Eventually, the probability of any fixed-size deviation vanishes.

Proof Sketch

(Simple proof assuming finite variance): If Var(Xi)=σ2<\text{Var}(X_i) = \sigma^2 < \infty, then Var(Xˉn)=σ2/n\text{Var}(\bar{X}_n) = \sigma^2/n. By Chebyshev's inequality:

Pr[Xˉnμ>ϵ]Var(Xˉn)ϵ2=σ2nϵ20\Pr[|\bar{X}_n - \mu| > \epsilon] \leq \frac{\text{Var}(\bar{X}_n)}{\epsilon^2} = \frac{\sigma^2}{n\epsilon^2} \to 0

(General proof with only finite mean): Use truncation. Define Yi=Xi1{Xin}Y_i = X_i \mathbf{1}\{|X_i| \leq n\}. Show that: (1) Pr[XiYi]0\Pr[X_i \neq Y_i] \to 0 by dominated convergence, (2) Var(Yi)/n0\text{Var}(Y_i)/n \to 0, and (3) E[Yi]μ\mathbb{E}[Y_i] \to \mu. Apply Chebyshev to the truncated variables.

Why It Matters

The WLLN is the justification for using sample statistics as estimates of population quantities. When you compute a sample mean, a sample variance, or an empirical risk, you are relying on the WLLN to guarantee that these quantities are close to their population counterparts for large nn.

Crucially, the WLLN only requires a finite mean --- not a finite variance. The Cauchy distribution has no finite mean, and the LLN genuinely fails: the sample mean of i.i.d. Cauchy variables does not converge. For i.i.d. sequences, finite mean is sufficient for the WLLN, and Kolmogorov's necessary-and-sufficient criterion shows that the WLLN with the natural centering E[X]\mathbb{E}[X] holds precisely when E[X]\mathbb{E}[X] exists. A weaker WLLN with truncated centering can still hold when EX=\mathbb{E}|X|=\infty provided nP(X>n)0n\, \mathbb{P}(|X|>n)\to 0; this is Feller's classical refinement and is why "necessary and sufficient" must be stated together with the choice of centering.

Failure Mode

The WLLN fails when E[X]=\mathbb{E}[|X|] = \infty. The canonical example is the Cauchy distribution: the sample mean of nn i.i.d. Cauchy variables has the same Cauchy distribution for every nn. Averaging does not help because the tails are too heavy. This is why concentration inequalities (which give non-asymptotic bounds) always require moment conditions.

Theorem

Strong Law of Large Numbers (SLLN)

Statement

If X1,X2,X_1, X_2, \ldots are i.i.d. with E[Xi]=μ<\mathbb{E}[X_i] = \mu < \infty, then:

Xˉna.s.μ\bar{X}_n \xrightarrow{\text{a.s.}} \mu

That is, Pr ⁣[limnXˉn=μ]=1\Pr\!\left[\lim_{n \to \infty} \bar{X}_n = \mu\right] = 1.

Intuition

The strong law says something more powerful than the weak law: not just that large deviations become unlikely, but that the sample mean actually converges for almost every sequence of outcomes. If you ran the experiment once and watched Xˉ1,Xˉ2,Xˉ3,\bar{X}_1, \bar{X}_2, \bar{X}_3, \ldots, the sequence would converge to μ\mu (with probability 1).

Proof Sketch

(Proof assuming finite fourth moment, for intuition): Compute E[(Xˉnμ)4]\mathbb{E}[(\bar{X}_n - \mu)^4]. After expanding and using independence: E[(Xˉnμ)4]=O(1/n2)\mathbb{E}[(\bar{X}_n - \mu)^4] = O(1/n^2). Then nE[(Xˉnμ)4]<\sum_n \mathbb{E}[(\bar{X}_n - \mu)^4] < \infty. By Markov's inequality: nPr[Xˉnμ>ϵ]<\sum_n \Pr[|\bar{X}_n - \mu| > \epsilon] < \infty for every ϵ>0\epsilon > 0. By the first Borel-Cantelli lemma: Pr[Xˉnμ>ϵ infinitely often]=0\Pr[|\bar{X}_n - \mu| > \epsilon \text{ infinitely often}] = 0. This gives almost sure convergence.

(General proof with only finite mean): Uses Kolmogorov's truncation technique and the Kolmogorov three-series theorem. The key idea is to truncate, apply the SLLN for bounded variables (using Borel-Cantelli), and show the truncation error vanishes.

Why It Matters

The SLLN justifies Monte Carlo simulation. When you estimate E[f(X)]\mathbb{E}[f(X)] by running a simulation and averaging f(X1),,f(Xn)f(X_1), \ldots, f(X_n), you want to know that the estimate converges to the true value as you run longer. The SLLN guarantees this: with probability 1, your simulation will eventually give the right answer.

For ML: the SLLN implies that the empirical distribution converges to the true distribution (in a pointwise sense), which is the starting point for uniform convergence arguments that control the generalization gap.

Failure Mode

Like the WLLN, the SLLN requires E[X]<\mathbb{E}[|X|] < \infty. An important subtlety: the SLLN can fail for non-identically distributed variables even with finite means. Kolmogorov's conditions for the SLLN with independent (not identically distributed) variables require: nVar(Xn)/n2<\sum_n \text{Var}(X_n)/n^2 < \infty (Kolmogorov's criterion).

Rate of Convergence

The LLN tells you that Xˉnμ\bar{X}_n \to \mu, but it does not tell you how fast. The rate of convergence comes from the central limit theorem:

n(Xˉnμ)dN(0,σ2)\sqrt{n}(\bar{X}_n - \mu) \xrightarrow{d} \mathcal{N}(0, \sigma^2)

This says the fluctuations of Xˉn\bar{X}_n around μ\mu are of order σ/n\sigma / \sqrt{n}. The 1/n1/\sqrt{n} rate is universal (for distributions with finite variance) and explains why:

  • Halving the error requires quadrupling the sample size
  • Monte Carlo estimates converge slowly: 4x more computation for 2x more accuracy
  • Concentration inequalities give bounds of order σ/n\sigma/\sqrt{n}

The CLT goes beyond the LLN by characterizing the shape of the fluctuations (Gaussian), not just their decay.

Kolmogorov's Conditions

For independent (but not necessarily identically distributed) random variables X1,X2,X_1, X_2, \ldots with E[Xi]=μi\mathbb{E}[X_i] = \mu_i, the SLLN 1ni(Xiμi)a.s.0\frac{1}{n}\sum_i (X_i - \mu_i) \xrightarrow{\text{a.s.}} 0 holds under Kolmogorov's condition:

n=1Var(Xn)n2<\sum_{n=1}^\infty \frac{\text{Var}(X_n)}{n^2} < \infty

This is satisfied, for instance, if the variances are uniformly bounded. The 1/n21/n^2 denominator comes from the Kronecker lemma combined with the Kolmogorov three-series theorem.

Canonical Examples

Example

Coin flips

Let XiBern(1/2)X_i \sim \text{Bern}(1/2). Then Xˉn\bar{X}_n is the fraction of heads in nn flips. The LLN says Xˉn1/2\bar{X}_n \to 1/2. Chebyshev gives the rate: Pr[Xˉn1/2>ϵ]14nϵ2\Pr[|\bar{X}_n - 1/2| > \epsilon] \leq \frac{1}{4n\epsilon^2}. For ϵ=0.01\epsilon = 0.01: n250,000n \geq 250{,}000 suffices for the probability to be at most 0.010.01.

The CLT gives a tighter bound: Pr[Xˉn1/2>ϵ]2Φ(2ϵn)\Pr[|\bar{X}_n - 1/2| > \epsilon] \approx 2\Phi(-2\epsilon\sqrt{n}) where Φ\Phi is the standard normal CDF.

Example

Empirical risk as LLN

The empirical risk R^n(h)=1ni(h(xi),yi)\hat{R}_n(h) = \frac{1}{n}\sum_i \ell(h(x_i), y_i) is the sample mean of (h(Xi),Yi)\ell(h(X_i), Y_i), which are i.i.d. with mean R(h)=E[(h(X),Y)]R(h) = \mathbb{E}[\ell(h(X), Y)]. By the SLLN:

R^n(h)a.s.R(h)\hat{R}_n(h) \xrightarrow{\text{a.s.}} R(h)

for each fixed hh. This is the pointwise convergence of empirical risk to population risk. The challenge of learning theory is to make this convergence uniform over hHh \in \mathcal{H}, which requires concentration inequalities and complexity measures (VC dimension, Rademacher complexity).

Common Confusions

Watch Out

Convergence in probability is NOT almost sure convergence

Consider: let Xn=1X_n = 1 with probability 1/n1/n and Xn=0X_n = 0 otherwise (independently). Then XnP0X_n \xrightarrow{P} 0 (since Pr[Xn>ϵ]=1/n0\Pr[|X_n| > \epsilon] = 1/n \to 0). But nPr[Xn=1]=1/n=\sum_n \Pr[X_n = 1] = \sum 1/n = \infty, so by the second Borel-Cantelli lemma (the events are independent), Xn=1X_n = 1 infinitely often with probability 1. Thus Xn̸a.s.0X_n \not\xrightarrow{\text{a.s.}} 0.

In this example, large deviations keep happening, just less and less frequently. Convergence in probability allows this; almost sure convergence does not.

Watch Out

The LLN requires finite mean, not finite variance

The simplest proof of the WLLN uses Chebyshev's inequality and requires finite variance. But the WLLN holds with only a finite mean (the proof uses truncation). The SLLN also holds with only a finite mean. Finite variance gives you the rate of convergence (via the CLT), but convergence itself needs only a finite mean.

Conversely, if E[X]=\mathbb{E}[|X|] = \infty, the LLN fails. The sample mean does not converge to any fixed value.

Watch Out

The LLN is about the sample mean, not individual observations

A common misstatement: "by the law of large numbers, extreme values become rare." This is wrong. Individual observations XiX_i always have the same distribution, no matter how large nn is. It is the average Xˉn\bar{X}_n that converges. Extreme values keep occurring at the same rate; they just get diluted by the average.

Summary

  • WLLN: XˉnPμ\bar{X}_n \xrightarrow{P} \mu (convergence in probability) --- requires only E[X]<\mathbb{E}[|X|] < \infty
  • SLLN: Xˉna.s.μ\bar{X}_n \xrightarrow{\text{a.s.}} \mu (almost sure convergence) --- also requires only E[X]<\mathbb{E}[|X|] < \infty
  • The SLLN is strictly stronger than the WLLN
  • Rate of convergence: fluctuations are O(σ/n)O(\sigma/\sqrt{n}) (from CLT)
  • LLN justifies: empirical risk as proxy for population risk, consistency of estimators, Monte Carlo methods
  • LLN fails for distributions with infinite mean (e.g., Cauchy)

Exercises

ExerciseCore

Problem

Simulate nn i.i.d. coin flips (XiBern(0.5)X_i \sim \text{Bern}(0.5)) and plot Xˉn\bar{X}_n as a function of nn for n=1,2,,10,000n = 1, 2, \ldots, 10{,}000. Run 10 independent simulations on the same plot. Verify visually that the sample mean converges to 0.5 and that the fluctuations decrease as 1/n1/\sqrt{n}.

ExerciseAdvanced

Problem

Give an example of a sequence of independent random variables XnX_n with E[Xn]=0\mathbb{E}[X_n] = 0 for all nn such that 1ni=1nXi\frac{1}{n}\sum_{i=1}^n X_i does not converge to 0 almost surely. Why does this not contradict the SLLN?

References

Canonical:

  • Durrett, Probability: Theory and Examples (5th ed., 2019), Sections 2.2-2.4
  • Billingsley, Probability and Measure (3rd ed., 1995), Sections 6, 22

Current:

  • Vershynin, High-Dimensional Probability (2018), Chapter 0 (motivation)
  • Wainwright, High-Dimensional Statistics (2019), Section 2.1
  • Kallenberg, Foundations of Modern Probability (3rd ed., 2021), Chapter 5 (strong laws, Kolmogorov's three-series theorem)

Next Topics

Building on the law of large numbers:

  • Central limit theorem: the rate and shape of convergence --- n(Xˉnμ)N(0,σ2)\sqrt{n}(\bar{X}_n - \mu) \to \mathcal{N}(0, \sigma^2)
  • Empirical risk minimization: the LLN in action for learning theory
  • Concentration inequalities: non-asymptotic versions of the LLN

Last reviewed: April 23, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

5

Derived topics

9

+4 more on the derived-topics page.