Law of Large Numbers

Sneiderman, Robby

Statistical Estimation

Law of Large Numbers

The weak and strong laws of large numbers: the sample mean converges to the population mean. Kolmogorov's conditions, the rate of convergence from the CLT, and why LLN justifies using empirical risk as a proxy for population risk.

CoreTier 1StableCore spine~50 min

Prerequisites

Random Variables Expectation Variance Covariance Moments Common Probability Distributions Borel Cantelli Lemmas

Quiz (16)Pulse Check Prereq Map

Why This Matters

The law of large numbers is the most fundamental result in all of statistics. It says: if you average enough i.i.d. observations, the average converges to the expected value. This single fact justifies:

Empirical risk minimization: the training loss $\frac{1}{n}\sum_i \ell(h(x_i), y_i)$ converges to the population risk $\mathbb{E}[\ell(h(X), Y)]$ as $n \to \infty$ . Without the LLN, there is no reason to believe that minimizing training loss has anything to do with minimizing true risk.
Consistency of estimators: the sample mean $\bar{X}_n$ is a consistent estimator of $\mu = \mathbb{E}[X]$ , where expectation is the population quantity we target. More generally, maximum likelihood estimators are consistent under regularity conditions, and the proof ultimately relies on the LLN.
Monte Carlo methods: estimating $\mathbb{E}[f(X)]$ by $\frac{1}{n}\sum_i f(X_i)$ works because of the LLN. Every MCMC method, every stochastic gradient estimator, and every simulation study rests on this.

If you do not understand the LLN, you do not understand why averaging works.

Want to see that separation visually? The Sampling and Limit Laws Lab shows one long running average and many reruns of the sample mean side by side, so you can see what the LLN claims and what it does not claim.

Quick Version

Statement	Meaning
Weak LLN	The sample mean gets close in probability
Strong LLN	The sample mean settles almost surely along almost every path
Main assumption	Finite mean is enough for i.i.d. variables
What it does not say	Individual draws become less random

Three independent running sample means; Cauchy shows the pathology with no finite mean

Three independent sequences of running sample means. As n grows the trajectories collapse toward μ. Resample to draw a new triple.

Mental Model

Flip a fair coin 10 times: you might get 7 heads (70%). Flip it 1,000 times: you will almost certainly get close to 50%. Flip it a million times: the fraction of heads will be within 0.1% of 50% with overwhelming probability.

The LLN formalizes this: the sample average $\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i$ converges to $\mu = \mathbb{E}[X_i]$ as $n \to \infty$ . The question is: what kind of convergence?

Modes of Convergence

Definition

Convergence in Probability $X_{n} P X$

A sequence $X_n$ converges in probability to $X$ if for every $\epsilon > 0$ :

$\lim_{n \to \infty} \Pr[|X_n - X| > \epsilon] = 0$

This says: the probability of $X_n$ being far from $X$ vanishes, but it does not rule out occasional deviations. There might be rare "bad" events where $X_n$ is far from $X$ , as long as these events become increasingly unlikely.

Definition

Almost Sure Convergence $X_{n} a.s. X$

A sequence $X_n$ converges almost surely (a.s.) to $X$ if and only if:

$\Pr\!\left[\lim_{n \to \infty} X_n = X\right] = 1$

Equivalently: $\Pr[\{\omega : X_n(\omega) \to X(\omega)\}] = 1$ .

This is strictly stronger than convergence in probability. Almost sure convergence says: for almost every outcome $\omega$ , the sequence $X_1(\omega), X_2(\omega), \ldots$ converges to $X(\omega)$ . Not just "unlikely to be far away," but "actually converges along each path."

The relationship: Almost sure convergence implies convergence in probability, but not vice versa. The distinction matters in practice: convergence in probability allows for occasional large deviations that become rare; almost sure convergence says the sequence eventually settles down for each outcome.

Main Theorems

Theorem

Weak Law of Large Numbers (WLLN)

Statement

If $X_1, X_2, \ldots$ are i.i.d. random variables with $\mathbb{E}[X_i] = \mu < \infty$ , then the sample mean converges in probability to $\mu$ :

$\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i \xrightarrow{P} \mu$

That is, for every $\epsilon > 0$ : $\lim_{n \to \infty} \Pr[|\bar{X}_n - \mu| > \epsilon] = 0$ .

Intuition

Averaging reduces fluctuations. Each $X_i$ fluctuates around $\mu$ , but when you average many of them, the positive and negative deviations tend to cancel. The more you average, the less the sample mean fluctuates. Eventually, the probability of any fixed-size deviation vanishes.

Proof Sketch

(Simple proof assuming finite variance): If $\text{Var}(X_i) = \sigma^2 < \infty$ , then $\text{Var}(\bar{X}_n) = \sigma^2/n$ . By Chebyshev's inequality:

$\Pr[|\bar{X}_n - \mu| > \epsilon] \leq \frac{\text{Var}(\bar{X}_n)}{\epsilon^2} = \frac{\sigma^2}{n\epsilon^2} \to 0$

(General proof with only finite mean): Use truncation. Define $Y_i = X_i \mathbf{1}\{|X_i| \leq n\}$ . Show that: (1) $\Pr[X_i \neq Y_i] \to 0$ by dominated convergence, (2) $\text{Var}(Y_i)/n \to 0$ , and (3) $\mathbb{E}[Y_i] \to \mu$ . Apply Chebyshev to the truncated variables.

Why It Matters

The WLLN is the justification for using sample statistics as estimates of population quantities. When you compute a sample mean, a sample variance, or an empirical risk, you are relying on the WLLN to guarantee that these quantities are close to their population counterparts for large $n$ .

Crucially, the WLLN only requires a finite mean --- not a finite variance. The Cauchy distribution has no finite mean, and the LLN genuinely fails: the sample mean of i.i.d. Cauchy variables does not converge. For i.i.d. sequences, finite mean is sufficient for the WLLN, and Kolmogorov's necessary-and-sufficient criterion shows that the WLLN with the natural centering $\mathbb{E}[X]$ holds precisely when $\mathbb{E}[X]$ exists. A weaker WLLN with truncated centering can still hold when $\mathbb{E}|X|=\infty$ provided $n\, \mathbb{P}(|X|>n)\to 0$ ; this is Feller's classical refinement and is why "necessary and sufficient" must be stated together with the choice of centering.

Failure Mode

The WLLN fails when $\mathbb{E}[|X|] = \infty$ . The canonical example is the Cauchy distribution: the sample mean of $n$ i.i.d. Cauchy variables has the same Cauchy distribution for every $n$ . Averaging does not help because the tails are too heavy. This is why concentration inequalities (which give non-asymptotic bounds) always require moment conditions.

report a correction →

Theorem

Strong Law of Large Numbers (SLLN)

Statement

If $X_1, X_2, \ldots$ are i.i.d. with $\mathbb{E}[X_i] = \mu < \infty$ , then:

$\bar{X}_n \xrightarrow{\text{a.s.}} \mu$

That is, $\Pr\!\left[\lim_{n \to \infty} \bar{X}_n = \mu\right] = 1$ .

Intuition

The strong law says something more powerful than the weak law: not just that large deviations become unlikely, but that the sample mean actually converges for almost every sequence of outcomes. If you ran the experiment once and watched $\bar{X}_1, \bar{X}_2, \bar{X}_3, \ldots$ , the sequence would converge to $\mu$ (with probability 1).

Proof Sketch

(Proof assuming finite fourth moment, for intuition): Compute $\mathbb{E}[(\bar{X}_n - \mu)^4]$ . After expanding and using independence: $\mathbb{E}[(\bar{X}_n - \mu)^4] = O(1/n^2)$ . Then $\sum_n \mathbb{E}[(\bar{X}_n - \mu)^4] < \infty$ . By Markov's inequality: $\sum_n \Pr[|\bar{X}_n - \mu| > \epsilon] < \infty$ for every $\epsilon > 0$ . By the first Borel-Cantelli lemma: $\Pr[|\bar{X}_n - \mu| > \epsilon \text{ infinitely often}] = 0$ . This gives almost sure convergence.

(General proof with only finite mean): Uses Kolmogorov's truncation technique and the Kolmogorov three-series theorem. The key idea is to truncate, apply the SLLN for bounded variables (using Borel-Cantelli), and show the truncation error vanishes.

Why It Matters

The SLLN justifies Monte Carlo simulation. When you estimate $\mathbb{E}[f(X)]$ by running a simulation and averaging $f(X_1), \ldots, f(X_n)$ , you want to know that the estimate converges to the true value as you run longer. The SLLN guarantees this: with probability 1, your simulation will eventually give the right answer.

For ML: the SLLN implies that the empirical distribution converges to the true distribution (in a pointwise sense), which is the starting point for uniform convergence arguments that control the generalization gap.

Failure Mode

Like the WLLN, the SLLN requires $\mathbb{E}[|X|] < \infty$ . An important subtlety: the SLLN can fail for non-identically distributed variables even with finite means. Kolmogorov's conditions for the SLLN with independent (not identically distributed) variables require: $\sum_n \text{Var}(X_n)/n^2 < \infty$ (Kolmogorov's criterion).

report a correction →

Rate of Convergence

The LLN tells you that $\bar{X}_n \to \mu$ , but it does not tell you how fast. The rate of convergence comes from the central limit theorem:

$\sqrt{n}(\bar{X}_n - \mu) \xrightarrow{d} \mathcal{N}(0, \sigma^2)$

This says the fluctuations of $\bar{X}_n$ around $\mu$ are of order $\sigma / \sqrt{n}$ . The $1/\sqrt{n}$ rate is universal (for distributions with finite variance) and explains why:

Halving the error requires quadrupling the sample size
Monte Carlo estimates converge slowly: 4x more computation for 2x more accuracy
Concentration inequalities give bounds of order $\sigma/\sqrt{n}$

The CLT goes beyond the LLN by characterizing the shape of the fluctuations (Gaussian), not just their decay.

Kolmogorov's Conditions

For independent (but not necessarily identically distributed) random variables $X_1, X_2, \ldots$ with $\mathbb{E}[X_i] = \mu_i$ , the SLLN $\frac{1}{n}\sum_i (X_i - \mu_i) \xrightarrow{\text{a.s.}} 0$ holds under Kolmogorov's condition:

$\sum_{n=1}^\infty \frac{\text{Var}(X_n)}{n^2} < \infty$

This is satisfied, for instance, if the variances are uniformly bounded. The $1/n^2$ denominator comes from the Kronecker lemma combined with the Kolmogorov three-series theorem.

Canonical Examples

Example

Coin flips

Let $X_i \sim \text{Bern}(1/2)$ . Then $\bar{X}_n$ is the fraction of heads in $n$ flips. The LLN says $\bar{X}_n \to 1/2$ . Chebyshev gives the rate: $\Pr[|\bar{X}_n - 1/2| > \epsilon] \leq \frac{1}{4n\epsilon^2}$ . For $\epsilon = 0.01$ : $n \geq 250{,}000$ suffices for the probability to be at most $0.01$ .

The CLT gives a tighter bound: $\Pr[|\bar{X}_n - 1/2| > \epsilon] \approx 2\Phi(-2\epsilon\sqrt{n})$ where $\Phi$ is the standard normal CDF.

Example

Empirical risk as LLN

The empirical risk $\hat{R}_n(h) = \frac{1}{n}\sum_i \ell(h(x_i), y_i)$ is the sample mean of $\ell(h(X_i), Y_i)$ , which are i.i.d. with mean $R(h) = \mathbb{E}[\ell(h(X), Y)]$ . By the SLLN:

$\hat{R}_n(h) \xrightarrow{\text{a.s.}} R(h)$

for each fixed $h$ . This is the pointwise convergence of empirical risk to population risk. The challenge of learning theory is to make this convergence uniform over $h \in \mathcal{H}$ , which requires concentration inequalities and complexity measures (VC dimension, Rademacher complexity).

Common Confusions

Watch Out

Convergence in probability is NOT almost sure convergence

Consider: let $X_n = 1$ with probability $1/n$ and $X_n = 0$ otherwise (independently). Then $X_n \xrightarrow{P} 0$ (since $\Pr[|X_n| > \epsilon] = 1/n \to 0$ ). But $\sum_n \Pr[X_n = 1] = \sum 1/n = \infty$ , so by the second Borel-Cantelli lemma (the events are independent), $X_n = 1$ infinitely often with probability 1. Thus $X_n \not\xrightarrow{\text{a.s.}} 0$ .

In this example, large deviations keep happening, just less and less frequently. Convergence in probability allows this; almost sure convergence does not.

Watch Out

The LLN requires finite mean, not finite variance

The simplest proof of the WLLN uses Chebyshev's inequality and requires finite variance. But the WLLN holds with only a finite mean (the proof uses truncation). The SLLN also holds with only a finite mean. Finite variance gives you the rate of convergence (via the CLT), but convergence itself needs only a finite mean.

Conversely, if $\mathbb{E}[|X|] = \infty$ , the LLN fails. The sample mean does not converge to any fixed value.

Watch Out

The LLN is about the sample mean, not individual observations

A common misstatement: "by the law of large numbers, extreme values become rare." This is wrong. Individual observations $X_i$ always have the same distribution, no matter how large $n$ is. It is the average $\bar{X}_n$ that converges. Extreme values keep occurring at the same rate; they just get diluted by the average.

Summary

WLLN: $\bar{X}_n \xrightarrow{P} \mu$ (convergence in probability) --- requires only $\mathbb{E}[|X|] < \infty$
SLLN: $\bar{X}_n \xrightarrow{\text{a.s.}} \mu$ (almost sure convergence) --- also requires only $\mathbb{E}[|X|] < \infty$
The SLLN is strictly stronger than the WLLN
Rate of convergence: fluctuations are $O(\sigma/\sqrt{n})$ (from CLT)
LLN justifies: empirical risk as proxy for population risk, consistency of estimators, Monte Carlo methods
LLN fails for distributions with infinite mean (e.g., Cauchy)

Exercises

ExerciseCore

Problem

Simulate $n$ i.i.d. coin flips ( $X_i \sim \text{Bern}(0.5)$ ) and plot $\bar{X}_n$ as a function of $n$ for $n = 1, 2, \ldots, 10{,}000$ . Run 10 independent simulations on the same plot. Verify visually that the sample mean converges to 0.5 and that the fluctuations decrease as $1/\sqrt{n}$ .

ExerciseAdvanced

Problem

Give an example of a sequence of independent random variables $X_n$ with $\mathbb{E}[X_n] = 0$ for all $n$ such that $\frac{1}{n}\sum_{i=1}^n X_i$ does not converge to 0 almost surely. Why does this not contradict the SLLN?

References

Canonical:

Durrett, Probability: Theory and Examples (5th ed., 2019), Sections 2.2-2.4
Billingsley, Probability and Measure (3rd ed., 1995), Sections 6, 22

Current:

Vershynin, High-Dimensional Probability (2018), Chapter 0 (motivation)
Wainwright, High-Dimensional Statistics (2019), Section 2.1
Kallenberg, Foundations of Modern Probability (3rd ed., 2021), Chapter 5 (strong laws, Kolmogorov's three-series theorem)

Next Topics

Building on the law of large numbers:

Central limit theorem: the rate and shape of convergence --- $\sqrt{n}(\bar{X}_n - \mu) \to \mathcal{N}(0, \sigma^2)$
Empirical risk minimization: the LLN in action for learning theory
Concentration inequalities: non-asymptotic versions of the LLN

Last reviewed: April 23, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

5

Common Probability Distributionslayer 0A · tier 1
Expectation, Variance, Covariance, and Momentslayer 0A · tier 1
Random Variableslayer 0A · tier 1
Borel-Cantelli Lemmaslayer 0B · tier 1
Modes of Convergence of Random Variableslayer 0B · tier 1

Derived topics

9

Central Limit Theoremlayer 0B · tier 1
Concentration Inequalitieslayer 1 · tier 1
Empirical Risk Minimizationlayer 2 · tier 1
Fat Tails and Heavy-Tailed Distributionslayer 2 · tier 1
LLN and CLT Failures Under Heavy Tailslayer 2 · tier 1

+4 more on the derived-topics page.

Graph-backed continuations

Central Limit Theorem Empirical Risk Minimization Concentration Inequalities Fat Tails and Heavy-Tailed Distributions Monte Carlo Methods Non-Probability Sampling Number Theory and Machine Learning