Skip to main content

Statistical Estimation

Sufficient Statistics and Exponential Families

Sufficient statistics compress data without losing information about the parameter. The Neyman-Fisher factorization theorem, exponential families, completeness, and Rao-Blackwell improvement of estimators.

CoreTier 2StableSupporting~60 min

Why This Matters

Every time you compute a sample mean and sample variance from Gaussian data, you are using sufficient statistics without realizing it. When the variance is known, the sample mean captures all the information the data has about the population mean and you can throw away the original data points without loss of information about μ\mu. When the variance is unknown, the minimal sufficient statistic for (μ,σ2)(\mu, \sigma^2) is the pair (Xˉ,S2)(\bar{X}, S^2) — the sample mean alone is no longer sufficient, because the spread of the data carries information about how precisely Xˉ\bar{X} pins down μ\mu.

Sufficient statistics tell you when data compression is lossless for inference. Exponential families are the class of distributions where sufficient statistics take a particularly clean form. These two ideas together explain why so many classical estimators have the structure they do, and they underlie the theoretical guarantees for MLE in parametric models.

theorem visual

Lossless Statistic Compression

A sufficient statistic collapses many raw datasets onto the same inference-relevant summary. Everything the likelihood still cares about flows through that statistic; the rest is nuisance detail.

Model family

active family = Gaussian mean

Raw datasets

sample 1

1.82.02.22.0

sample 2

1.92.12.02.0

Sufficient statistic

sample-mean statistic

shared value = 2.0

Distinct raw samples land on the same statistic, so the likelihood sees them as inferentially equivalent.

Likelihood dependence

Factorization test

If the parameter touches the data only through , then conditioning on the full sample adds no new information.

Exponential-family payoff

In canonical exponential families, the sufficient statistic is built directly into the exponent. That is why MLE, conjugacy, and moment identities all become algebraically clean.

Rao-Blackwell intuition

Once a statistic is sufficient, averaging any rough estimator against that statistic removes needless noise without losing signal about the parameter.

Mental Model

You observe nn data points and want to estimate θ\theta. A sufficient statistic T(X)T(X) is a function of the data that captures everything the data can tell you about θ\theta. Given T(X)T(X), the conditional distribution of the data does not depend on θ\theta. So T(X)T(X) is a lossless summary for the purpose of inference.

The factorization theorem gives a simple test: the statistic T(X)T(X) is sufficient if and only if the joint density factors into a piece that depends on θ\theta only through TT and a piece that does not depend on θ\theta at all.

Formal Setup and Notation

Let X=(X1,,Xn)X = (X_1, \ldots, X_n) be i.i.d. from p(xθ)p(x | \theta) where θΘ\theta \in \Theta.

Definition

Sufficient Statistic

A statistic T(X)T(X) is sufficient for θ\theta if and only if the conditional distribution of XX given T(X)T(X) does not depend on θ\theta:

p(XT(X)=t,θ)=p(XT(X)=t)for all θp(X | T(X) = t, \theta) = p(X | T(X) = t) \quad \text{for all } \theta

Equivalently, T(X)T(X) captures all the information in XX about θ\theta. Once you know T(X)T(X), the remaining randomness in XX is pure noise with respect to θ\theta.

Definition

Minimal Sufficient Statistic

A sufficient statistic TT is minimal sufficient if and only if it is a function of every other sufficient statistic. That is, for any other sufficient statistic UU, there exists a function gg such that T=g(U)T = g(U). A minimal sufficient statistic achieves the maximum data reduction possible without losing information about θ\theta.

Main Theorems

Theorem

Neyman-Fisher Factorization Theorem

Statement

A statistic T(X)T(X) is sufficient for θ\theta if and only if the joint density (or pmf) can be factored as:

p(x1,,xnθ)=g(T(x),θ)h(x)p(x_1, \ldots, x_n | \theta) = g(T(x), \theta) \cdot h(x)

where gg depends on the data only through T(x)T(x), and hh depends on the data but not on θ\theta.

Intuition

The factorization says the likelihood splits into two parts. The part that depends on θ\theta sees the data only through TT. The part that depends on the full data does not care about θ\theta. So for the purpose of learning about θ\theta, TT is all you need.

Proof Sketch

(Sufficiency implies factorization): Write p(xθ)=p(xT(x),θ)p(T(x)θ)p(x | \theta) = p(x | T(x), \theta) \cdot p(T(x) | \theta). Since TT is sufficient, p(xT(x),θ)=p(xT(x))=h(x)p(x | T(x), \theta) = p(x | T(x)) = h(x). Set g(T(x),θ)=p(T(x)θ)g(T(x), \theta) = p(T(x) | \theta).

(Factorization implies sufficiency): If p(xθ)=g(T(x),θ)h(x)p(x | \theta) = g(T(x), \theta) \cdot h(x), then p(xT(x)=t,θ)=p(xθ)/p(T(x)=tθ)p(x | T(x) = t, \theta) = p(x | \theta) / p(T(x) = t | \theta). The numerator is g(t,θ)h(x)g(t, \theta) h(x) and the denominator is g(t,θ)x:T(x)=th(x)g(t, \theta) \sum_{x': T(x')=t} h(x'). These cancel, giving h(x)/x:T(x)=th(x)h(x) / \sum_{x': T(x')=t} h(x'), which does not depend on θ\theta.

Why It Matters

The factorization theorem is the practical workhorse for finding sufficient statistics. You write down the likelihood, identify what functions of the data appear in the θ\theta-dependent part, and those functions form a sufficient statistic. For exponential families, this immediately identifies the natural sufficient statistics.

Failure Mode

The factorization must hold for ALL values of θ\theta simultaneously. A common mistake is to find a factorization that works for one specific θ\theta value but not all. Also, the factorization depends on the support of the distribution: if the support depends on θ\theta (e.g., Uniform(0,θ)(0, \theta)), be careful with indicator functions.

Exponential Families

Definition

Exponential Family

A parametric family is an exponential family if and only if the density can be written as:

p(xθ)=h(x)exp ⁣(η(θ)T(x)A(θ))p(x | \theta) = h(x) \exp\!\left(\eta(\theta)^\top T(x) - A(\theta)\right)

where:

  • T(x)RkT(x) \in \mathbb{R}^k is the sufficient statistic
  • η(θ)Rk\eta(\theta) \in \mathbb{R}^k is the natural parameter
  • A(θ)A(\theta) is the log-partition function (ensures normalization)
  • h(x)0h(x) \geq 0 is the base measure

When the parameterization uses η\eta directly (i.e., η\eta is the free parameter), the family is in canonical form: p(xη)=h(x)exp(ηT(x)A(η))p(x | \eta) = h(x) \exp(\eta^\top T(x) - A(\eta)).

Most distributions you encounter are exponential families: Gaussian, Bernoulli, Poisson, Exponential, Gamma, Beta, Multinomial, and Wishart. Notable exceptions: the Cauchy distribution, mixture models, and the Uniform(0,θ)(0, \theta) distribution.

Key properties of exponential families:

  1. Sufficient statistics: T(X)T(X) is always sufficient (by factorization)
  2. MLE is unique when it exists: the log-likelihood is concave in η\eta (strictly concave when the family is minimal and of full rank), so there are no local optima. Existence can fail at the boundary of the natural parameter space. Canonical failure cases: all-success or all-failure Bernoulli samples (MLE for η=logit(p)\eta = \text{logit}(p) is ±\pm\infty), all-zero Poisson samples (η=logλ=\eta = \log\lambda = -\infty), and separated data in logistic regression. Existence typically requires the observed sufficient statistic to lie in the interior of the convex hull of its support
  3. Moment-generating properties: E[T(X)]=ηA(η)\mathbb{E}[T(X)] = \nabla_\eta A(\eta) and Cov(T(X))=η2A(η)\text{Cov}(T(X)) = \nabla^2_\eta A(\eta). The log-partition function generates all the moments of TT
  4. Conjugate priors: every exponential family has a natural conjugate prior of the form π(ητ,n0)exp(τηn0A(η))\pi(\eta \mid \tau, n_0) \propto \exp(\tau^\top \eta - n_0 A(\eta)), and posterior updates reduce to incrementing (τ,n0)(\tau, n_0) by the sufficient statistic and pseudo-count. The closed-form update is guaranteed; full tractability (computable normalizing constant, posterior moments, marginal likelihood, predictive density) follows for standard cases (Beta-Bernoulli, Normal-Normal, Gamma-Poisson, Dirichlet-Multinomial) but not automatically: the conjugate normalizing constant is itself an exponential-family integral and can lack closed form, and posterior expectations and predictive distributions may still require numerical integration even when the prior–posterior pair is conjugate
Definition

Log-Partition Function

The log-partition function ensures normalization:

A(η)=logh(x)exp(ηT(x))dxA(\eta) = \log \int h(x) \exp(\eta^\top T(x)) \, dx

It is always convex in η\eta (because it is a log of an integral of exponentials). Its first derivative gives the expected sufficient statistic: A(η)=Eη[T(X)]\nabla A(\eta) = \mathbb{E}_\eta[T(X)]. Its second derivative gives the variance: 2A(η)=Covη(T(X))\nabla^2 A(\eta) = \text{Cov}_\eta(T(X)), which is also the Fisher information in the natural parameterization: I(η)=2A(η)I(\eta) = \nabla^2 A(\eta).

Dual parameterization. Because AA is convex, the map μ=A(η)\mu = \nabla A(\eta) takes the natural parameter to the mean parameter μ=Eη[T(X)]\mu = \mathbb{E}_\eta[T(X)]. This map is a bijection onto the interior of the marginal polytope (for minimal families), and its inverse is η=A(μ)\eta = \nabla A^*(\mu), where A(μ)=supη{ημA(η)}A^*(\mu) = \sup_\eta \{\eta^\top \mu - A(\eta)\} is the Legendre-Fenchel conjugate of AA. The pair (A,A)(A, A^*) generates the dual geometry studied in information geometry.

Conjugate Priors

For a canonical exponential-family likelihood p(xη)=h(x)exp(ηT(x)A(η))p(x | \eta) = h(x) \exp(\eta^\top T(x) - A(\eta)), the natural conjugate prior on the natural parameter η\eta has the form:

π(ητ,n0)exp ⁣(τηn0A(η))\pi(\eta | \tau, n_0) \propto \exp\!\left(\tau^\top \eta - n_0 A(\eta)\right)

with hyperparameters τRk\tau \in \mathbb{R}^k and n0>0n_0 > 0. The hyperparameter n0n_0 acts as a pseudo-count and τ\tau as a pseudo-sufficient-statistic. After observing xx with sufficient statistic T(x)T(x), the posterior is in the same family with updated hyperparameters:

ττ+T(x),n0n0+1.\tau \mapsto \tau + T(x), \qquad n_0 \mapsto n_0 + 1.

For nn i.i.d. observations, the updates accumulate: ττ+iT(xi)\tau \mapsto \tau + \sum_i T(x_i) and n0n0+nn_0 \mapsto n_0 + n.

Example

Three canonical conjugate pairs

Beta-Bernoulli. If XiBernoulli(θ)X_i \sim \text{Bernoulli}(\theta) with prior θBeta(α,β)\theta \sim \text{Beta}(\alpha, \beta), the posterior after observing s=ixis = \sum_i x_i successes in nn trials is θXBeta(α+s,β+ns)\theta | X \sim \text{Beta}(\alpha + s, \beta + n - s). Here τ\tau tracks successes and n0n_0 tracks total trials.

Normal-Normal (known variance). If XiN(μ,σ2)X_i \sim \mathcal{N}(\mu, \sigma^2) with known σ2\sigma^2 and prior μN(μ0,σ02)\mu \sim \mathcal{N}(\mu_0, \sigma_0^2), the posterior mean is a precision-weighted average:

μXN ⁣(σ02μ0+nσ2xˉσ02+nσ2,(σ02+nσ2)1).\mu | X \sim \mathcal{N}\!\left(\frac{\sigma_0^{-2}\mu_0 + n\sigma^{-2}\bar{x}}{\sigma_0^{-2} + n\sigma^{-2}}, \, (\sigma_0^{-2} + n\sigma^{-2})^{-1}\right).

Gamma-Poisson. If XiPoisson(λ)X_i \sim \text{Poisson}(\lambda) with prior λGamma(α,β)\lambda \sim \text{Gamma}(\alpha, \beta) (shape-rate), the posterior is λXGamma(α+ixi,β+n)\lambda | X \sim \text{Gamma}(\alpha + \sum_i x_i, \beta + n).

The posterior predictive, marginal likelihood, and Bayes factor all reduce to closed-form functions of AA, which is what makes conjugate updating analytically clean.

KL Divergence as a Bregman Divergence

The KL divergence between two members of the same exponential family has a clean closed form in terms of the log-partition function:

KL(pη1pη2)=A(η2)A(η1)A(η1)(η2η1).\mathrm{KL}(p_{\eta_1} \,\|\, p_{\eta_2}) = A(\eta_2) - A(\eta_1) - \nabla A(\eta_1)^\top (\eta_2 - \eta_1).

This is exactly the Bregman divergence generated by the convex function AA, evaluated at (η2,η1)(\eta_2, \eta_1). It measures the gap between A(η2)A(\eta_2) and its first-order Taylor approximation at η1\eta_1, so it is non-negative and zero only when η1=η2\eta_1 = \eta_2. In the dual (mean) parameterization, the same KL equals the Bregman divergence generated by AA^* with arguments swapped. This dual Bregman structure is the starting point for information geometry and explains why natural-gradient updates, moment-matching projections, and variational bounds take the forms they do.

Completeness

Definition

Complete Statistic

A sufficient statistic TT is complete if and only if for any function gg:

Eθ[g(T)]=0 for all θ    g(T)=0 a.s.\mathbb{E}_\theta[g(T)] = 0 \text{ for all } \theta \implies g(T) = 0 \text{ a.s.}

Completeness means there is no non-trivial function of TT that has mean zero for all θ\theta. In a minimal (full-rank) kk-parameter exponential family whose natural parameter space contains a non-empty open set in Rk\mathbb{R}^k, the natural sufficient statistic is complete. The full-rank / open-set condition is essential: curved exponential families (e.g. N(μ,μ2)\mathcal{N}(\mu, \mu^2)) have natural parameters constrained to a lower-dimensional manifold and their natural sufficient statistic is not complete, so Lehmann-Scheffe does not directly apply.

Completeness matters because it guarantees uniqueness: if TT is complete and sufficient, then any unbiased estimator based on TT is the unique best unbiased estimator (UMVUE). This connects to the Rao-Blackwell theorem below.

Rao-Blackwell Theorem

Theorem

Rao-Blackwell Theorem

Statement

Let UU be any unbiased estimator of τ(θ)\tau(\theta) and let TT be a sufficient statistic. Define:

U~=E[UT]\tilde{U} = \mathbb{E}[U | T]

Then U~\tilde{U} is:

  1. A function of TT alone (not of the full data)
  2. Unbiased for τ(θ)\tau(\theta)
  3. At least as good as UU: Varθ(U~)Varθ(U)\text{Var}_\theta(\tilde{U}) \leq \text{Var}_\theta(U) for all θ\theta, with equality only if UU is already a function of TT.

Intuition

Conditioning on a sufficient statistic can only help (or not hurt) estimation. The sufficient statistic contains all the information about θ\theta. Any remaining randomness in UU beyond what TT captures is pure noise. Conditioning on TT averages out this noise, reducing variance while preserving unbiasedness.

Proof Sketch

Unbiasedness: E[U~]=E[E[UT]]=E[U]=τ(θ)\mathbb{E}[\tilde{U}] = \mathbb{E}[\mathbb{E}[U|T]] = \mathbb{E}[U] = \tau(\theta) by the tower property.

Variance reduction: By the law of total variance: Var(U)=E[Var(UT)]+Var(E[UT])=E[Var(UT)]+Var(U~)\text{Var}(U) = \mathbb{E}[\text{Var}(U|T)] + \text{Var}(\mathbb{E}[U|T]) = \mathbb{E}[\text{Var}(U|T)] + \text{Var}(\tilde{U}).

Since E[Var(UT)]0\mathbb{E}[\text{Var}(U|T)] \geq 0, we get Var(U)Var(U~)\text{Var}(U) \geq \text{Var}(\tilde{U}).

Why It Matters

Rao-Blackwell says: never ignore a sufficient statistic. If you have any unbiased estimator, you can improve it (or at least not hurt it) by conditioning on a sufficient statistic. Combined with completeness, this gives the Lehmann-Scheffe theorem: if TT is complete and sufficient, then E[UT]\mathbb{E}[U|T] is the unique minimum-variance unbiased estimator (UMVUE).

Failure Mode

Rao-Blackwell improves unbiased estimators, but unbiasedness itself is not always desirable. Biased estimators (like the James-Stein estimator or ridge regression) can have lower MSE. The Rao-Blackwell theorem operates within the class of unbiased estimators and cannot compare across that boundary.

Canonical Examples

Example

Sufficient statistic for Gaussian mean

Let X1,,XnN(μ,σ2)X_1, \ldots, X_n \sim \mathcal{N}(\mu, \sigma^2) with known σ2\sigma^2. The joint density is:

p(xμ)=(2πσ2)n/2exp ⁣(12σ2i(xiμ)2)p(x|\mu) = (2\pi\sigma^2)^{-n/2} \exp\!\left(-\frac{1}{2\sigma^2}\sum_i(x_i - \mu)^2\right)

Expanding the square: i(xiμ)2=ixi22μixi+nμ2\sum_i(x_i - \mu)^2 = \sum_i x_i^2 - 2\mu \sum_i x_i + n\mu^2.

By factorization: g(T,μ)=exp(12σ2(2μnxˉ+nμ2))g(T, \mu) = \exp(-\frac{1}{2\sigma^2}(-2\mu n\bar{x} + n\mu^2)) where T=Xˉ=1niXiT = \bar{X} = \frac{1}{n}\sum_i X_i. The sample mean is sufficient for μ\mu. This is an exponential family with natural parameter η=μ/σ2\eta = \mu/\sigma^2 and sufficient statistic T=ixiT = \sum_i x_i.

Example

Exponential family form of the Poisson distribution

p(xλ)=λxeλx!=1x!exp(xlogλλ)p(x | \lambda) = \frac{\lambda^x e^{-\lambda}}{x!} = \frac{1}{x!} \exp(x \log\lambda - \lambda).

This is an exponential family with T(x)=xT(x) = x, η=logλ\eta = \log\lambda, A(η)=eη=λA(\eta) = e^\eta = \lambda, and h(x)=1/x!h(x) = 1/x!. For nn i.i.d. observations, T=iXiT = \sum_i X_i is sufficient for λ\lambda.

Common Confusions

Watch Out

Sufficient does not mean minimal sufficient

The entire data vector X=(X1,,Xn)X = (X_1, \ldots, X_n) is always trivially sufficient (the identity is a sufficient statistic). The interesting question is how much you can compress. Minimal sufficiency gives the maximum compression. For a minimal (full-rank) kk-parameter exponential family, the minimal sufficient statistic is kk-dimensional regardless of sample size nn. In curved (non-full-rank) exponential families the parameter manifold has dimension less than kk, so the natural sufficient statistic is still kk-dimensional but it is no longer minimal: the MSS is typically of lower dimension and need not be complete (e.g. N(μ,μ2)\mathcal{N}(\mu, \mu^2)).

Watch Out

Not all distributions are exponential families

Mixture distributions are not exponential families (the sufficient statistic dimension grows with nn). The Cauchy distribution is not an exponential family. The Uniform(0,θ)(0, \theta) is not (because the support depends on θ\theta). When you are outside exponential families, the clean theory of sufficient statistics and conjugate priors does not apply as neatly.

Summary

  • A statistic T(X)T(X) is sufficient if and only if the conditional distribution of XX given TT does not depend on θ\theta
  • Factorization theorem: p(xθ)=g(T(x),θ)h(x)p(x|\theta) = g(T(x), \theta) \cdot h(x) characterizes sufficiency
  • Exponential families: p(xθ)=h(x)exp(η(θ)T(x)A(θ))p(x|\theta) = h(x) \exp(\eta(\theta)^\top T(x) - A(\theta))
  • The log-partition function A(η)A(\eta) generates moments of TT: E[T]=A\mathbb{E}[T] = \nabla A, Cov(T)=2A\text{Cov}(T) = \nabla^2 A
  • Completeness + sufficiency gives uniqueness of UMVUE
  • Rao-Blackwell: condition on a sufficient statistic to improve any unbiased estimator

Exercises

ExerciseCore

Problem

Find the sufficient statistic for θ\theta in the Bernoulli model: X1,,XnBernoulli(θ)X_1, \ldots, X_n \sim \text{Bernoulli}(\theta). Write the joint pmf in exponential family form and identify the natural parameter, sufficient statistic, and log-partition function.

ExerciseAdvanced

Problem

Let X1,,XnUniform(0,θ)X_1, \ldots, X_n \sim \text{Uniform}(0, \theta). Show that T=X(n)=maxiXiT = X_{(n)} = \max_i X_i is sufficient for θ\theta but this is not an exponential family. Why does this matter for the MLE?

ExerciseResearch

Problem

Prove that in a kk-parameter exponential family where the natural parameter space contains an open set, the natural sufficient statistic T(X)=i=1nT(Xi)T(X) = \sum_{i=1}^n T(X_i) is complete. Why does this, combined with Rao-Blackwell, imply that any unbiased estimator based on TT is UMVUE?

References

Canonical:

  • Casella & Berger, Statistical Inference (2nd ed., 2002), Chapters 6-7
  • Lehmann & Casella, Theory of Point Estimation (2nd ed., 1998), Chapters 1-4
  • Keener, Theoretical Statistics (2010), Chapters 3-4

Current:

  • Wasserman, All of Statistics (2004), Chapter 9

  • Wainwright & Jordan, "Graphical Models, Exponential Families, and Variational Inference" (2008)

  • van der Vaart, Asymptotic Statistics (1998), Chapters 2-8

Next Topics

Building on sufficient statistics and exponential families:

  • Fisher information: the curvature of the log-likelihood, directly related to the log-partition function in exponential families
  • Hypothesis testing for ML: using sufficient statistics to construct optimal tests
  • EM algorithm: exploiting exponential family structure for latent variable models

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Derived topics

7

+2 more on the derived-topics page.