Skip to main content

Statistical Estimation

Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiency

MLE: pick the parameter that maximizes the likelihood of observed data. Score function, Bartlett identities, regularity conditions, consistency, asymptotic normality, Wilks' theorem, Cramér-Rao efficiency, exponential families, QMLE under misspecification, and the bridge to deep-learning negative log-likelihood training.

CoreTier 1StableCore spine~90 min

Why This Matters

Maximum likelihood estimation is the workhorse of parametric inference, and almost every modern training objective is a likelihood in disguise. Train a logistic regression by minimizing cross-entropy loss and you are running MLE on a Bernoulli/categorical model. Fit a Gaussian mixture model and you are running MLE via EM. Train a language model by minimizing perplexity and you are running MLE on a categorical sequence model. Even modern generative models (variational autoencoders, normalizing flows, autoregressive transformers) train against the negative log-likelihood (NLL) or a tractable bound on it.

Hide overviewShow overview
Five-panel infographic on MLE: setup with i.i.d. data and likelihood as a product of densities, log-likelihood algebra (turns product into sum, monotone transformation preserves argmax), geometric intuition with the likelihood function peaked at theta_MLE, worked example for normal mean (theta_MLE = sample mean), and why MLE matters (consistency, asymptotic normality, invariance under reparameterization), with the warning that the likelihood is not a probability distribution over theta unless a prior is introduced.
MLE picks the parameter value under which the observed data look least surprising. It is consistent and asymptotically normal under regularity conditions, and invariant under one-to-one reparameterizations.

Five questions decide whether the MLE is the right tool for a given problem:

  1. Does it converge to the truth? Consistency: θ^nPθ0\hat\theta_n \xrightarrow{P} \theta_0 under identifiability and standard regularity (Cramér 1946; Wald 1949).
  2. How accurate is it? Asymptotic normality: n(θ^nθ0)dN(0,I(θ0)1)\sqrt{n}(\hat\theta_n - \theta_0) \xrightarrow{d} \mathcal N(0, I(\theta_0)^{-1}) under Cramér-type regularity.
  3. Can any unbiased estimator do better? Cramér-Rao: no regular unbiased estimator has smaller variance than 1/(nI(θ0))1/(nI(\theta_0)). The MLE attains this bound asymptotically.
  4. What if the model is wrong? White's QMLE: the MLE converges to the least-false parameter θ=argminθDKL(P0pθ)\theta^* = \arg\min_\theta D_{\mathrm{KL}}(P_0 \| p_\theta), with sandwich variance V1AVV^{-1} A V^{-\top} replacing I1I^{-1}.
  5. What can we test? Wilks' theorem: 2logΛndχq2-2 \log \Lambda_n \xrightarrow{d} \chi^2_q under H0H_0 (with non-central χ2\chi^2 under contiguous alternatives), so the likelihood-ratio statistic has a calibrated null distribution.

Each answer requires the information identity I(θ)=Eθ[2logp(X;θ)]=Eθ[s(θ;X)s(θ;X)]I(\theta) = -\mathbb E_\theta[\nabla^2 \log p(X;\theta)] = \mathbb E_\theta[s(\theta;X) s(\theta;X)^\top], which fails the moment regularity does, and recognizing those failures is what separates a working statistician from a textbook user.

Likelihood gives one geometric object and three equally important readings

Maximum likelihood picks the parameter where the log-likelihood peaks.

MLE

Optimization target

The MLE is the parameter value where the observed data has the largest joint density under the model.

Why logs matter

The log transform turns products into sums, making numerical optimization and asymptotic analysis possible.

ML translation

Cross-entropy, token negative log-likelihood, and many probabilistic deep-learning losses are MLE objectives in disguise.

Mental Model

You observe x1,,xnx_1, \ldots, x_n from p(xθ0)p(x \mid \theta_0) with unknown θ0ΘRd\theta_0 \in \Theta \subseteq \mathbb R^d. The MLE asks: which θ\theta assigns the largest joint density to what you actually saw? Two equivalent framings:

  • Pick the parameter under which your data is least surprising. Each datum contributes a log-likelihood logp(xi;θ)\log p(x_i; \theta); the MLE maximizes the sum.
  • Minimize empirical cross-entropy against the model. θ^n=argminθ1ni=1nlogpθ(xi)\hat\theta_n = \arg\min_\theta -\frac{1}{n}\sum_{i=1}^n \log p_\theta(x_i). This form is what most ML loss functions are: minimizing categorical cross-entropy or negative log-likelihood is exactly MLE. For discrete models with shared support, this also equals the information projection argminθDKL(P^npθ)\arg\min_\theta D_{\mathrm{KL}}(\hat P_n \| p_\theta) in Csiszár's sense, since the empirical entropy is θ\theta-independent. For continuous-density models the empirical P^n\hat P_n is a discrete mixture of point masses, so DKL(P^npθ)=D_{\mathrm{KL}}(\hat P_n \| p_\theta) = \infty in the strict measure-theoretic sense; the KL framing then holds only as a formal/heuristic identity, while empirical cross-entropy minimization holds rigorously.

The log transform converts products into sums, which buys two things at once: numerical stability (no underflow on long sequences) and access to the law of large numbers and central limit theorem (sums of i.i.d. terms are exactly what asymptotic statistics is designed to handle). This single change turns finite-sample maximum-density into an asymptotic theory of estimators.

Formal Setup and Notation

Let X1,,XnX_1, \ldots, X_n be i.i.d. from p(x;θ0)p(x; \theta_0) where θ0ΘRd\theta_0 \in \Theta \subseteq \mathbb R^d. The parametric family P={p(;θ):θΘ}\mathcal P = \{p(\cdot ; \theta) : \theta \in \Theta\} is the statistical model. The dominating measure is μ\mu (Lebesgue on Rd\mathbb R^d for continuous data, counting on Z\mathbb Z for discrete data); pp is a density with respect to μ\mu.

Definition

Likelihood Function

The likelihood function is

Ln(θ)=i=1np(Xi;θ).L_n(\theta) = \prod_{i=1}^n p(X_i; \theta).

It is a function of θ\theta with the data fixed. Despite looking like a density on the data space, Ln(θ)L_n(\theta) is not a probability distribution over θ\theta (it does not integrate to one over Θ\Theta). Bayesians multiply by a prior π(θ)\pi(\theta) and renormalize to obtain a posterior; frequentists treat LnL_n as the input to estimation and testing.

Definition

Log-Likelihood

The log-likelihood is

n(θ)=i=1nlogp(Xi;θ).\ell_n(\theta) = \sum_{i=1}^n \log p(X_i; \theta).

The log transform turns products into sums, which is required for the law of large numbers, the central limit theorem, and floating-point stability on long product chains.

Definition

Maximum Likelihood Estimator

The MLE is any measurable maximizer

θ^nargmaxθΘn(θ).\hat\theta_n \in \arg\max_{\theta \in \Theta} \ell_n(\theta).

When Θ\Theta is open and n\ell_n is differentiable, θ^n\hat\theta_n solves the score equation i=1ns(θ;Xi)=0\sum_{i=1}^n s(\theta; X_i) = 0 where s(θ;x)=θlogp(x;θ)s(\theta; x) = \nabla_\theta \log p(x; \theta). The MLE may not exist (infinite or non-attained supremum), may not be unique (multimodal n\ell_n), and may sit on the boundary of Θ\Theta. Existence and uniqueness require structural assumptions (compactness of Θ\Theta, concavity of n\ell_n, identifiability, or all three).

Score Function and Fisher Information

Definition

Score Function

The score function is the gradient of the log-density:

s(θ;x)=θlogp(x;θ)=θp(x;θ)p(x;θ).s(\theta; x) = \nabla_\theta \log p(x; \theta) = \frac{\nabla_\theta p(x; \theta)}{p(x; \theta)}.

Under regularity (interchange of differentiation and integration), the score has mean zero at the true parameter:

Eθ0[s(θ0;X)]=θp(x;θ0)dμ(x)=θp(x;θ0)dμ(x)=θ1=0.\mathbb E_{\theta_0}[s(\theta_0; X)] = \int \nabla_\theta p(x; \theta_0) \, d\mu(x) = \nabla_\theta \int p(x; \theta_0) \, d\mu(x) = \nabla_\theta 1 = 0.

The MLE satisfies is(θ^n;Xi)=0\sum_i s(\hat\theta_n; X_i) = 0, the empirical analog of E[s(θ0;X)]=0\mathbb E[s(\theta_0; X)] = 0.

Definition

Fisher Information

The (per-observation, expected) Fisher information at θ\theta is

I(θ)=Eθ[s(θ;X)s(θ;X)]=Covθ(s(θ;X)),I(\theta) = \mathbb E_\theta[s(\theta; X) \, s(\theta; X)^\top] = \mathrm{Cov}_\theta(s(\theta; X)),

a d×dd \times d positive semidefinite matrix. Under second-order regularity (twice differentiability of logp\log p and a second-order interchange) it equals the negative expected Hessian:

I(θ)=Eθ[θ2logp(X;θ)].I(\theta) = -\mathbb E_\theta[\nabla^2_\theta \log p(X; \theta)].

For nn i.i.d. observations the total information is nI(θ)n I(\theta). Fisher information measures how sharply the log-likelihood concentrates around the truth: large I(θ)I(\theta) means a steep peak (easy to estimate), small I(θ)I(\theta) means a flat surface (hard to estimate).

Definition

Observed Fisher Information

The (data-dependent) observed Fisher information is

Jn(θ)=1ni=1nθ2logp(Xi;θ).J_n(\theta) = -\frac{1}{n} \sum_{i=1}^n \nabla^2_\theta \log p(X_i; \theta).

By the law of large numbers Jn(θ)PI(θ)J_n(\theta) \xrightarrow{P} I(\theta). In practice one reports Jn(θ^n)1/nJ_n(\hat\theta_n)^{-1}/n as a finite-sample variance estimate; under misspecification Jn(θ^n)J_n(\hat\theta_n) does not equal the population I(θ)I(\theta) and the sandwich form is needed (see QMLE below).

The Information Identity

The two formulas I(θ)=E[ss]I(\theta) = \mathbb E[ss^\top] and I(θ)=E[2logp]I(\theta) = -\mathbb E[\nabla^2 \log p] are not separate definitions; they are equal under regularity, and the equality is the single most-used identity in likelihood theory. Bartlett (1953) generalized this to a family of cumulant identities obtained by repeated differentiation of p=1\int p = 1.

Theorem

Bartlett Identities (Information Equality)

Statement

Under regularity, the score has mean zero and Fisher information equals the negative expected Hessian:

Eθ[s(θ;X)]=0,Covθ(s(θ;X))=I(θ)=Eθ[θ2logp(X;θ)].\mathbb E_\theta[s(\theta; X)] = 0, \qquad \mathrm{Cov}_\theta(s(\theta; X)) = I(\theta) = -\mathbb E_\theta[\nabla^2_\theta \log p(X; \theta)].

Equivalently, Eθ[θs(θ;X)+s(θ;X)s(θ;X)]=0\mathbb E_\theta[\nabla_\theta s(\theta; X) + s(\theta; X) s(\theta; X)^\top] = 0, the second Bartlett identity.

Intuition

Differentiate 1=p(x;θ)dμ(x)1 = \int p(x; \theta) \, d\mu(x) once: 0=θp=spdμ=E[s]0 = \int \nabla_\theta p = \int s \cdot p \, d\mu = \mathbb E[s]. Differentiate again:

0=θ2pdμ=(θs+ss)pdμ=E[θs]+E[ss].0 = \int \nabla^2_\theta p \, d\mu = \int (\nabla_\theta s + s s^\top) p \, d\mu = \mathbb E[\nabla_\theta s] + \mathbb E[s s^\top].

Since E[θs]=E[θ2logp]\mathbb E[\nabla_\theta s] = \mathbb E[\nabla^2_\theta \log p], this rearranges to E[2logp]=E[ss]=I(θ)-\mathbb E[\nabla^2 \log p] = \mathbb E[s s^\top] = I(\theta). The identity is bookkeeping that turns one expression for II into the other; the content is that you can interchange differentiation and integration.

Proof Sketch

Step 1 (first identity). Differentiate p(x;θ)dμ(x)=1\int p(x; \theta) \, d\mu(x) = 1 with respect to θ\theta. Pass the derivative inside the integral (assumption). Use p=sp\nabla p = s \cdot p. Conclude E[s]=0\mathbb E[s] = 0.

Step 2 (second identity). Differentiate again. By the same interchange,

0=2pdμ=(s+ss)pdμ.0 = \int \nabla^2 p \, d\mu = \int \big( \nabla s + s s^\top \big) p \, d\mu.

A cleaner path is to differentiate the score directly. From s=logp=p/ps = \nabla \log p = \nabla p / p, differentiate:

s=2pp(p)(p)p2=2ppss.\nabla s = \frac{\nabla^2 p}{p} - \frac{(\nabla p)(\nabla p)^\top}{p^2} = \frac{\nabla^2 p}{p} - s s^\top.

Take expectation under θ\theta, using E[2p/p]=2pdμ=0\mathbb E[\nabla^2 p / p] = \int \nabla^2 p \, d\mu = 0 from Step 1's interchange:

E[s]=E[ss]=I(θ).\mathbb E[\nabla s] = -\mathbb E[s s^\top] = -I(\theta).

Since s=2logp\nabla s = \nabla^2 \log p, this gives E[2logp]=I(θ)\mathbb E[\nabla^2 \log p] = -I(\theta), equivalently I(θ)=E[2logp]I(\theta) = -\mathbb E[\nabla^2 \log p].

Why It Matters

The information identity is the linchpin of three subsequent results:

  1. Asymptotic normality of the MLE uses E[2logp]=I(θ)-\mathbb E[\nabla^2 \log p] = I(\theta) to identify the variance.
  2. Cramér-Rao uses Cov(s)=I(θ)\mathrm{Cov}(s) = I(\theta) in the Cauchy-Schwarz step.
  3. Wilks' theorem uses both forms when expanding logL(θ^)logL(θ0)\log L(\hat\theta) - \log L(\theta_0) to second order.

When the identity fails (parameter on a boundary, support depending on θ\theta, non-smooth logp\log p), every classical asymptotic for the MLE fails along with it, and you need M-estimator machinery (sandwich variance, cube-root rates, mixture-χ2\chi^2 limits) instead.

Failure Mode

The interchange step requires more than two derivatives existing. Standard counterexamples: (1) Uniform(0,θ)(0, \theta) with support [0,θ][0, \theta] depends on θ\theta, so p=0\nabla \int p = 0 but p0\int \nabla p \neq 0; the score is not even well-defined and the MLE is X(n)X_{(n)} with rate nn, not n\sqrt n. (2) Models with a discontinuity in pp at a θ\theta-dependent threshold (change-point models) violate the same interchange; the MLE has a non-Gaussian limit at rate nn. (3) Mixture models on the boundary (π=0\pi = 0) make I(θ)I(\theta) singular and the second identity meaningless.

Regularity Conditions

The asymptotic theorems below require the parametric family to be regular. The canonical Cramér-type list (van der Vaart 1998 Ch. 5; Lehmann-Casella 1998 Ch. 6):

  1. Identifiability: θ1θ2p(;θ1)p(;θ2)\theta_1 \neq \theta_2 \Rightarrow p(\cdot; \theta_1) \neq p(\cdot; \theta_2) as measures.
  2. Common support: {x:p(x;θ)>0}\{x : p(x; \theta) > 0\} does not depend on θ\theta.
  3. Interior: θ0int(Θ)\theta_0 \in \mathrm{int}(\Theta).
  4. Smoothness: logp(x;θ)\log p(x; \theta) is three times continuously differentiable in θ\theta, with third derivatives uniformly bounded by an integrable envelope M(x)M(x) in a neighborhood of θ0\theta_0.
  5. Interchange: differentiation under the integral sign is valid through second order (justified by dominated convergence with the same envelope).
  6. Information: I(θ0)I(\theta_0) is finite and positive definite.

Le Cam (1970) replaced conditions 4-5 with the much weaker differentiability in quadratic mean (DQM): there exists a score-like vector ˙θ0\dot \ell_{\theta_0} such that pθ0+hpθ012h˙θ0pθ0=o(h)\sqrt{p_{\theta_0 + h}} - \sqrt{p_{\theta_0}} - \tfrac 12 h^\top \dot \ell_{\theta_0} \sqrt{p_{\theta_0}} = o(\|h\|) in L2(μ)L^2(\mu). DQM holds for all common parametric families and yields the same asymptotic normality without requiring three derivatives or an envelope. See asymptotic-statistics for the LAN treatment.

Main Theorems

Theorem

Consistency of the MLE (Wald)

Statement

Under Wald's conditions, the MLE is consistent:

θ^nPθ0as n.\hat\theta_n \xrightarrow{P} \theta_0 \quad \text{as } n \to \infty.

If Θ\Theta is open (rather than compact), the same conclusion holds when n\ell_n is concave and θ0\theta_0 is the unique maximizer of θEθ0[logp(X;θ)]\theta \mapsto \mathbb E_{\theta_0}[\log p(X; \theta)].

Intuition

The empirical log-likelihood per observation ˉn(θ)=n1ilogp(Xi;θ)\bar\ell_n(\theta) = n^{-1} \sum_i \log p(X_i; \theta) is a sample average of i.i.d. terms. By the law of large numbers, ˉn(θ)L(θ):=Eθ0[logp(X;θ)]\bar \ell_n(\theta) \to L(\theta) := \mathbb E_{\theta_0}[\log p(X; \theta)] pointwise. The Kullback-Leibler inequality gives

L(θ0)L(θ)=DKL(pθ0pθ)0,L(\theta_0) - L(\theta) = D_{\mathrm{KL}}(p_{\theta_0} \| p_\theta) \geq 0,

with equality only when pθ0=pθp_{\theta_0} = p_\theta, i.e. (by identifiability) θ=θ0\theta = \theta_0. So θ0\theta_0 uniquely maximizes the limit objective. Under uniform convergence (Wald's compactness + envelope, or concavity), the maximizer of ˉn\bar\ell_n converges to the maximizer of LL, which is θ0\theta_0.

Proof Sketch

Step 1 (LLN). For each θ\theta, ˉn(θ)a.s.L(θ)\bar\ell_n(\theta) \xrightarrow{a.s.} L(\theta) by the strong law.

Step 2 (uniform convergence). Compactness + envelope yields supθΘˉn(θ)L(θ)a.s.0\sup_{\theta \in \Theta} | \bar\ell_n(\theta) - L(\theta) | \xrightarrow{a.s.} 0 (uniform LLN; van der Vaart Lemma 19.36 or Newey-McFadden Lemma 2.4).

Step 3 (KL identifies the max). L(θ0)L(θ)=DKL(pθ0pθ)>0L(\theta_0) - L(\theta) = D_{\mathrm{KL}}(p_{\theta_0} \| p_\theta) > 0 for θθ0\theta \neq \theta_0 by Gibbs' inequality and identifiability.

Step 4 (argmax convergence). A standard argmax-continuity argument (van der Vaart Theorem 5.7) gives θ^nPθ0\hat\theta_n \xrightarrow{P} \theta_0.

Why It Matters

Consistency is the minimum bar for any estimator and the prerequisite for asymptotic normality (which is local around θ0\theta_0). The proof structure has three parts: LLN gives pointwise convergence; uniformity comes from compactness or concavity; KL identifies the maximizer. It is the template every modern consistency proof reuses (M-estimators, GMM, EM, deep learning's empirical loss).

Failure Mode

(1) Non-identifiability. Mixture models with label-switching: (π,μ1,μ2)(\pi, \mu_1, \mu_2) and (1π,μ2,μ1)(1-\pi, \mu_2, \mu_1) give the same density. The KL gap collapses on the orbit, and the MLE is not consistent for any single labeling. (2) Boundary blow-up. In Gaussian mixtures, sending one component's variance to zero with its mean equal to a data point makes LnL_n \to \infty. The MLE does not exist; constrained or penalized variants are needed. (3) Neyman-Scott. Parameter dimension growing with nn (one mean per pair of observations, one shared variance) gives an inconsistent variance MLE. Profile or modified likelihoods recover consistency. (4) Misspecification. If pθ0p_{\theta_0} is not in the model family, the MLE converges to the least-false parameter argminθDKL(P0pθ)\arg\min_\theta D_{\mathrm{KL}}(P_0 \| p_\theta), not to any "true" θ0\theta_0 (see QMLE below).

Theorem

Asymptotic Normality of the MLE

Statement

Under regularity and consistency, the MLE is asymptotically normal:

n(θ^nθ0)dN ⁣(0,I(θ0)1).\sqrt{n}(\hat\theta_n - \theta_0) \xrightarrow{d} \mathcal N\!\left(0, I(\theta_0)^{-1}\right).

Equivalently, θ^nN(θ0,I(θ0)1/n)\hat\theta_n \approx \mathcal N(\theta_0, I(\theta_0)^{-1}/n) for large nn. The variance I(θ0)1/nI(\theta_0)^{-1}/n is the Cramér-Rao lower bound, attained by the MLE in the limit.

Intuition

Taylor-expand the score equation around θ0\theta_0:

0=1nis(θ^n;Xi)1nis(θ0;Xi)+[1nis(θ0;Xi)](θ^nθ0).0 = \frac{1}{n} \sum_i s(\hat\theta_n; X_i) \approx \frac{1}{n} \sum_i s(\theta_0; X_i) + \left[\frac{1}{n} \sum_i \nabla s(\theta_0; X_i)\right] (\hat\theta_n - \theta_0).

The first term, scaled by n\sqrt n, is a CLT for an i.i.d. mean-zero sum with covariance I(θ0)I(\theta_0) (Bartlett identity I), so n1/2is(θ0;Xi)dN(0,I(θ0))n^{-1/2} \sum_i s(\theta_0; X_i) \xrightarrow{d} \mathcal N(0, I(\theta_0)). The bracket converges in probability to E[s(θ0;X)]=I(θ0)\mathbb E[\nabla s(\theta_0; X)] = -I(\theta_0) (Bartlett identity II + LLN). Solving,

n(θ^nθ0)I(θ0)1n1/2is(θ0;Xi)dN(0,I(θ0)1I(θ0)I(θ0)1)=N(0,I(θ0)1).\sqrt n (\hat\theta_n - \theta_0) \approx I(\theta_0)^{-1} \cdot n^{-1/2} \sum_i s(\theta_0; X_i) \xrightarrow{d} \mathcal N(0, I(\theta_0)^{-1} I(\theta_0) I(\theta_0)^{-1}) = \mathcal N(0, I(\theta_0)^{-1}).

Proof Sketch

Step 1 (Taylor). Mean-value-expand the score: 0=sˉn(θ^n)=sˉn(θ0)+sn(θ~)(θ^nθ0)0 = \bar s_n(\hat\theta_n) = \bar s_n(\theta_0) + \overline{\nabla s}_n(\tilde\theta) (\hat\theta_n - \theta_0) for some θ~\tilde\theta between θ^n\hat\theta_n and θ0\theta_0.

Step 2 (CLT). nsˉn(θ0)=n1/2is(θ0;Xi)dN(0,I(θ0))\sqrt n \cdot \bar s_n(\theta_0) = n^{-1/2} \sum_i s(\theta_0; X_i) \xrightarrow{d} \mathcal N(0, I(\theta_0)) by the multivariate CLT (Bartlett identity I gives the variance).

Step 3 (LLN + consistency). sn(θ~)PE[s(θ0;X)]=I(θ0)\overline{\nabla s}_n(\tilde\theta) \xrightarrow{P} \mathbb E[\nabla s(\theta_0; X)] = -I(\theta_0) by the uniform LLN plus continuity at θ0\theta_0 and consistency of θ^n\hat\theta_n.

Step 4 (Slutsky). Combining steps 2 and 3,

n(θ^nθ0)=sn(θ~)1nsˉn(θ0)dI(θ0)1N(0,I(θ0))=N(0,I(θ0)1).\sqrt n (\hat\theta_n - \theta_0) = -\overline{\nabla s}_n(\tilde\theta)^{-1} \cdot \sqrt n \cdot \bar s_n(\theta_0) \xrightarrow{d} I(\theta_0)^{-1} \mathcal N(0, I(\theta_0)) = \mathcal N(0, I(\theta_0)^{-1}).

Why It Matters

Asymptotic normality is the engine behind every classical inferential procedure built on the MLE: Wald confidence intervals θ^n±z1α/2I(θ^n)1/n\hat\theta_n \pm z_{1-\alpha/2} \sqrt{I(\hat\theta_n)^{-1}/n}, Wald hypothesis tests, model comparison via AIC and BIC (which use the asymptotic distribution of the maximized log-likelihood). The functional form N(0,I1)\mathcal N(0, I^{-1}) also plugs straight into the delta method for any smooth g(θ^n)g(\hat\theta_n), giving CIs for derived quantities (odds ratios, hazard ratios, predicted probabilities).

Failure Mode

Asymptotic normality fails in four canonical cases. (1) θ0\theta_0 sits on the boundary of Θ\Theta (variance equal to zero, mixture proportion equal to zero): the limit becomes a projection of a Gaussian onto a tangent cone (Self-Liang 1987). (2) Fisher information is singular (flat likelihood, weak identification): the rate slows below n\sqrt n or the limit becomes non-Gaussian. (3) The model is misspecified: replace I(θ0)1I(\theta_0)^{-1} with the sandwich V1AVV^{-1} A V^{-\top} (QMLE, below). (4) Finite-sample concerns dominate, especially in high dimensions where the n\sqrt n regime requires nd2n \gg d^2 in many examples.

Theorem

Wilks' Theorem (LRT Asymptotic χ²)

Statement

Let Λn=Ln(θ^n(0))/Ln(θ^n)\Lambda_n = L_n(\hat\theta_n^{(0)}) / L_n(\hat\theta_n) be the likelihood ratio comparing the constrained MLE under H0H_0 to the unconstrained MLE. Under the null,

2logΛn=2(n(θ^n)n(θ^n(0)))dχq2,-2 \log \Lambda_n = 2 \big(\ell_n(\hat\theta_n) - \ell_n(\hat\theta_n^{(0)})\big) \xrightarrow{d} \chi^2_q,

where qq is the codimension of the null (number of restrictions). Under contiguous alternatives θn=θ0+h/n\theta_n = \theta_0 + h/\sqrt n, the limit is non-central χq2(δ)\chi^2_q(\delta) with non-centrality δ=hI(θ0)resth\delta = h^\top I(\theta_0)_{\mathrm{rest}} h, where I(θ0)restI(\theta_0)_{\mathrm{rest}} is the restricted block of Fisher information.

Intuition

Taylor-expand the log-likelihood around the unconstrained MLE:

n(θ^n)n(θ^n(0))12(θ^nθ^n(0))nI(θ0)(θ^nθ^n(0))+oP(1).\ell_n(\hat\theta_n) - \ell_n(\hat\theta_n^{(0)}) \approx \tfrac 12 (\hat\theta_n - \hat\theta_n^{(0)})^\top n I(\theta_0) (\hat\theta_n - \hat\theta_n^{(0)}) + o_P(1).

The vector n(θ^nθ^n(0))\sqrt n (\hat\theta_n - \hat\theta_n^{(0)}) is asymptotically Gaussian, projected onto the orthogonal complement of the null tangent space (a qq-dimensional subspace). Quadratic forms of qq-dim Gaussians with the right covariance are exactly χq2\chi^2_q.

Proof Sketch

Step 1. Under regularity, n(θ^nθ0)dZN(0,I1)\sqrt n (\hat\theta_n - \theta_0) \xrightarrow{d} Z \sim \mathcal N(0, I^{-1}) and n(θ^n(0)θ0)dPH0Z\sqrt n (\hat\theta_n^{(0)} - \theta_0) \xrightarrow{d} P_{H_0} Z (projection onto the null tangent space in the Fisher metric).

Step 2. Second-order Taylor expansion of n\ell_n around θ^n\hat\theta_n gives 2(n(θ^n)n(θ^n(0)))=n(θ^nθ^n(0))I(θ0)(θ^nθ^n(0))+oP(1)2 \big(\ell_n(\hat\theta_n) - \ell_n(\hat\theta_n^{(0)})\big) = n (\hat\theta_n - \hat\theta_n^{(0)})^\top I(\theta_0) (\hat\theta_n - \hat\theta_n^{(0)}) + o_P(1) (the gradient at θ^n\hat\theta_n is zero, the Hessian at θ0\theta_0 converges to I(θ0)-I(\theta_0) by Bartlett II).

Step 3. Substituting Step 1 into Step 2: the limit is ZPH0ZI2\| Z - P_{H_0} Z \|^2_{I}, the squared Fisher norm of the projection of ZZ onto the orthogonal complement of H0H_0. This equals (IPH0)ZI2=Z(IPH0)I(IPH0)Z\| (I - P_{H_0}) Z \|^2_I = Z^\top (I - P_{H_0})^\top I (I - P_{H_0}) Z. Substituting Var(Z)=I1\mathrm{Var}(Z) = I^{-1} and choosing coordinates so that the orthogonal complement is the first qq axes gives χq2\chi^2_q.

Why It Matters

Wilks' theorem makes the likelihood-ratio test (LRT) practical: you can compute 2logΛn-2 \log \Lambda_n and compare to a χq2\chi^2_q critical value without ever computing Fisher information by hand. It is the basis for nested-model comparison in regression (drop qq predictors and the LRT has a χq2\chi^2_q null), in GLMs (deviance differences), in mixed models, in survival analysis (Cox PH), and in modern model selection (BIC penalty derives from the Wilks limit). The non-central χ2\chi^2 result under contiguous alternatives gives the asymptotic power of the test; see the LRT vs Wald vs score equivalence in asymptotic-statistics.

Failure Mode

Wilks' fails on the same boundaries that asymptotic normality does. Self-Liang (1987) shows that on a parameter-space boundary the limit is a mixture of χ2\chi^2 distributions (e.g., for testing σ2=0\sigma^2 = 0 in a mixed model the limit is 12χ02+12χ12\tfrac 12 \chi^2_0 + \tfrac 12 \chi^2_1). Under non-identifiability (testing whether a mixture component is present, the famous "two-component vs. one-component" test) the limit is non-standard and depends on a nuisance parameter under the alternative (Davies 1977). Don't apply Wilks blindly when the null is on a boundary or weakly identified.

Theorem

Cramér-Rao Lower Bound

Statement

Let T=T(X1,,Xn)T = T(X_1, \ldots, X_n) be unbiased for θR\theta \in \mathbb R. Then

Varθ(T)1nI(θ).\mathrm{Var}_\theta(T) \geq \frac{1}{n I(\theta)}.

In the multivariate case (θRd\theta \in \mathbb R^d), Covθ(T)(nI(θ))1\mathrm{Cov}_\theta(T) \succeq (n I(\theta))^{-1} in the Loewner order. An estimator attaining the bound is called efficient. The MLE achieves the bound asymptotically (the previous theorem).

Intuition

Differentiating Eθ[T]=θ\mathbb E_\theta[T] = \theta in θ\theta gives Cov(T,sn)=1\mathrm{Cov}(T, s_n) = 1 where sn=is(θ;Xi)s_n = \sum_i s(\theta; X_i) is the total score. Cauchy-Schwarz gives 1=Cov(T,sn)2Var(T)Var(sn)=Var(T)nI(θ)1 = |\mathrm{Cov}(T, s_n)|^2 \leq \mathrm{Var}(T) \cdot \mathrm{Var}(s_n) = \mathrm{Var}(T) \cdot n I(\theta). The information sets the price tag on precision: each unit of Fisher information buys a 1/I(θ)1/I(\theta) unit of variance reduction.

Proof Sketch

Step 1. From Eθ[T]=θ\mathbb E_\theta[T] = \theta, differentiate under the integral: 1=Tθpdμ=Tspdμ=E[Tsn]1 = \int T \, \nabla_\theta p \, d\mu = \int T \cdot s \cdot p \, d\mu = \mathbb E[T \cdot s_n] for n=1n=1; the i.i.d. case generalizes to E[Tsn(tot)]=1\mathbb E[T \cdot s_n^{(\mathrm{tot})}] = 1.

Step 2. Since E[s]=0\mathbb E[s] = 0, Cov(T,sn(tot))=1\mathrm{Cov}(T, s_n^{(\mathrm{tot})}) = 1.

Step 3. Cauchy-Schwarz: 1=Cov2Var(T)Var(sn(tot))=Var(T)nI(θ)1 = |\mathrm{Cov}|^2 \leq \mathrm{Var}(T) \cdot \mathrm{Var}(s_n^{(\mathrm{tot})}) = \mathrm{Var}(T) \cdot n I(\theta).

Step 4. Rearrange: Var(T)1/(nI(θ))\mathrm{Var}(T) \geq 1 / (n I(\theta)).

Why It Matters

Cramér-Rao is the universal benchmark for unbiased estimators. The MLE attains it asymptotically (third theorem of this section), making MLE the asymptotically minimum-variance unbiased estimator in regular models. It also exposes a deep duality: minimum variance is the inverse of Fisher information, and Fisher information is the local Riemannian metric on the parameter manifold (Rao 1945; see information geometry). The bound's existence is what makes "no unbiased estimator can outperform the MLE" a precise statement rather than folklore.

Failure Mode

Cramér-Rao binds unbiased estimators only. Biased estimators can have strictly lower mean-squared error: MSE=Var+Bias2\mathrm{MSE} = \mathrm{Var} + \mathrm{Bias}^2, and a small bias can buy a large variance reduction. The James-Stein estimator is the canonical example (see Confusion below). Modern regularization (ridge, LASSO, weight decay, early stopping) trades a controlled bias for a large variance reduction and routinely beats the MLE in mean-squared error. In high dimensions the MLE is inadmissible and shrinkage is the rule, not the exception.

Theorem

Quasi-MLE under Misspecification (White)

Statement

Suppose P0pθP_0 \neq p_\theta for any θ\theta (the model is misspecified). Define

V(θ)=EP0[θ2logpθ(X)],A(θ)=EP0[s(θ;X)s(θ;X)].V(\theta^*) = -\mathbb E_{P_0}[\nabla^2_\theta \log p_{\theta^*}(X)], \qquad A(\theta^*) = \mathbb E_{P_0}[s(\theta^*; X) \, s(\theta^*; X)^\top].

Then the MLE converges to the least-false parameter θ\theta^* and is asymptotically normal with the sandwich variance:

n(θ^nθ)dN(0,V(θ)1A(θ)V(θ)).\sqrt n (\hat\theta_n - \theta^*) \xrightarrow{d} \mathcal N(0, V(\theta^*)^{-1} A(\theta^*) V(\theta^*)^{-\top}).

Under correct specification, the second Bartlett identity gives V=A=I(θ0)V = A = I(\theta_0) and the sandwich collapses to I(θ0)1I(\theta_0)^{-1}.

Intuition

Without correct specification, the score still has mean zero at the least-false parameter (because θ\theta^* minimizes KL, hence sets the gradient of EP0[logpθ]\mathbb E_{P_0}[\log p_\theta] to zero), but the score covariance and the Hessian are no longer linked by the information identity. The Taylor argument that gave I1I^{-1} in the well-specified case now gives the more general M-estimator sandwich V1AVV^{-1} A V^{-\top}. Reporting I1I^{-1} (or, equivalently, Jn1J_n^{-1}) when the model is wrong gives anti-conservative confidence intervals that miss their nominal coverage.

Proof Sketch

Step 1. Mean-value expand sˉn(θ^n)=0\bar s_n(\hat\theta_n) = 0 around θ\theta^*: 0=sˉn(θ)+sn(θ~)(θ^nθ)0 = \bar s_n(\theta^*) + \overline{\nabla s}_n(\tilde\theta)(\hat\theta_n - \theta^*).

Step 2. Under P0P_0: nsˉn(θ)dN(0,A(θ))\sqrt n \cdot \bar s_n(\theta^*) \xrightarrow{d} \mathcal N(0, A(\theta^*)) (CLT with the P0P_0-covariance, which is not equal to the model-implied Fisher information).

Step 3. sn(θ~)PV(θ)\overline{\nabla s}_n(\tilde\theta) \xrightarrow{P} -V(\theta^*) (LLN under P0P_0 + consistency).

Step 4. Slutsky: n(θ^nθ)dV(θ)1N(0,A(θ))=N(0,V1AV)\sqrt n (\hat\theta_n - \theta^*) \xrightarrow{d} V(\theta^*)^{-1} \mathcal N(0, A(\theta^*)) = \mathcal N(0, V^{-1} A V^{-\top}).

Why It Matters

Real models are always at least slightly wrong. Heteroskedastic linear regression fit by OLS, logistic regression with the wrong link, mixture models with the wrong number of components, neural networks fit to data not generated by them: all are quasi-MLE. The sandwich variance gives the correct asymptotic covariance, and robust standard errors (Huber-White) are the implementation of this in econometric practice. Reporting I1I^{-1}-based standard errors when the model is misspecified is the most common subtle error in applied likelihood inference.

Failure Mode

Even QMLE requires identifiability of the least-false parameter θ\theta^* and finite second moments of the score under P0P_0. Without uniqueness of θ\theta^* (e.g., the KL surface has multiple equal-depth minima) the MLE has no well-defined limit. Without finite moments (heavy-tailed data and a Gaussian model) the sandwich blows up. Wilks' theorem also fails under misspecification: 2logΛn-2 \log \Lambda_n converges to a weighted sum of χ12\chi^2_1 (Vuong 1989) rather than a clean χq2\chi^2_q, so naive LRTs over-reject.

MLE in Exponential Families

Exponential families are where MLE has the most structure: the score is linear in sufficient statistics, the Fisher information is the variance of those statistics, and the MLE coincides with method-of-moments on the canonical parameter.

A kk-parameter exponential family in canonical form has density

p(x;η)=h(x)exp ⁣(ηT(x)A(η)),p(x; \eta) = h(x) \exp\!\big( \eta^\top T(x) - A(\eta) \big),

where T(x)RkT(x) \in \mathbb R^k is the sufficient statistic, ηRk\eta \in \mathbb R^k is the natural parameter, A(η)=logh(x)exp(ηT(x))dxA(\eta) = \log \int h(x) \exp(\eta^\top T(x)) \, dx is the log-partition function, and h(x)0h(x) \geq 0 is the base measure.

Key identities. A(η)=Eη[T(X)]=μ(η)\nabla A(\eta) = \mathbb E_\eta[T(X)] = \mu(\eta) (mean of the sufficient statistic). 2A(η)=Covη(T(X))=I(η)\nabla^2 A(\eta) = \mathrm{Cov}_\eta(T(X)) = I(\eta) (Fisher information equals the covariance of the sufficient statistic, with no Bartlett identity required; it follows from convexity of AA). Score: s(η;x)=T(x)μ(η)s(\eta; x) = T(x) - \mu(\eta).

MLE. The score equation is iT(Xi)=nμ(η^n)\sum_i T(X_i) = n \mu(\hat\eta_n), i.e. Tˉn=μ(η^n)\bar T_n = \mu(\hat\eta_n). The MLE matches the empirical mean of the sufficient statistic to its model-implied mean. This is the moment-matching property and explains why for canonical-form exponential families MoM and MLE coincide.

Existence and uniqueness. The MLE exists if Tˉn\bar T_n is in the interior of the convex hull of {T(x):h(x)>0}\{T(x) : h(x) > 0\} (Barndorff-Nielsen 1978). It is unique because AA is strictly convex, hence μ=A\mu = \nabla A is one-to-one.

Examples reduce to bookkeeping. Bernoulli: T(x)=xT(x) = x, μ(η)=eη/(1+eη)=p\mu(\eta) = e^\eta/(1+e^\eta) = p, score equation Xˉ=p^\bar X = \hat p. Poisson: T(x)=xT(x) = x, μ(η)=eη=λ\mu(\eta) = e^\eta = \lambda, Xˉ=λ^\bar X = \hat\lambda. Multivariate Gaussian with known Σ\Sigma: T(x)=Σ1xT(x) = \Sigma^{-1} x, MLE μ^=Xˉ\hat\mu = \bar X. Linear exponential families (GLMs with canonical link) inherit the same structure: the MLE solves Xy=Xμ(Xβ^)X^\top y = X^\top \mu(X \hat\beta).

MLE as Empirical Cross-Entropy Minimization

A perspective that pays off in modern ML:

θ^n=argminθ1ni=1n[logp(Xi;θ)],\hat\theta_n = \arg\min_\theta \frac{1}{n}\sum_{i=1}^n \big[ -\log p(X_i; \theta) \big],

i.e. the MLE minimizes the empirical cross-entropy of the data against the model. For discrete models whose support is consistent with the data, this is also Csiszár's information projection of the empirical distribution onto the model, θ^n=argminθDKL(P^npθ)+const\hat\theta_n = \arg\min_\theta D_{\mathrm{KL}}(\hat P_n \| p_\theta) + \mathrm{const}, where the θ\theta-independent constant is the empirical entropy. For continuous-density models the strict measure-theoretic KL is infinite (the empirical P^n\hat P_n has mass at nn points, while pθp_\theta is absolutely continuous with respect to Lebesgue), so the "KL projection" reading is heuristic; the cross-entropy reading is the one that holds rigorously, and it is also what cross-entropy training in practice minimizes. Three corollaries:

  • Cross-entropy training is MLE. Minimizing categorical cross-entropy on classification labels is exactly MLE for a categorical model with class probabilities σ(logits)\sigma(\mathrm{logits}). Every softmax classifier is an MLE.
  • Perplexity training is MLE. Language model perplexity is exp(NLL/n)\exp(\mathrm{NLL}/n), so minimizing perplexity equals minimizing NLL equals running MLE on a categorical sequence model. The same holds for autoregressive image and audio models.
  • ELBO and variational MLE. Variational autoencoders cannot compute the marginal likelihood, so they maximize the evidence lower bound (ELBO), which is a tractable surrogate for MLE. Diffusion models do the same with a multi-step variational bound (the LtL_t losses sum to an upper bound on NLL). When the variational gap is small, you are running approximate MLE.

The MLE is not a niche statistical procedure; it is the asymptotic objective that every probabilistic deep learning method tries to optimize, exactly or approximately.

Worked Examples

Example

MLE for Gaussian mean (variance known)

X1,,XnN(μ,σ2)X_1, \ldots, X_n \sim \mathcal N(\mu, \sigma^2) with σ2\sigma^2 known. Log-likelihood: (μ)=n2log(2πσ2)12σ2i(Xiμ)2\ell(\mu) = -\tfrac n 2 \log(2\pi\sigma^2) - \tfrac{1}{2\sigma^2} \sum_i (X_i - \mu)^2. Score: s(μ)=1σ2i(Xiμ)s(\mu) = \tfrac{1}{\sigma^2} \sum_i (X_i - \mu). Setting it to zero: μ^=Xˉ\hat\mu = \bar X. Fisher information: I(μ)=1/σ2I(\mu) = 1/\sigma^2. Asymptotic variance: σ2/n\sigma^2/n. The result holds exactly at every nn, not only asymptotically: Xˉ\bar X is the uniformly minimum-variance unbiased estimator (UMVUE) and attains Cramér-Rao for all sample sizes.

Example

MLE for Gaussian variance

X1,,XnN(μ,σ2)X_1, \ldots, X_n \sim \mathcal N(\mu, \sigma^2), both unknown. The MLE is μ^=Xˉ\hat\mu = \bar X, σ^2=n1i(XiXˉ)2\hat\sigma^2 = n^{-1} \sum_i (X_i - \bar X)^2. σ^2\hat\sigma^2 is biased downward: E[σ^2]=n1nσ2\mathbb E[\hat\sigma^2] = \tfrac{n-1}{n} \sigma^2. The unbiased estimator divides by n1n - 1 and is a routine bias-correction. Fisher information is block-diagonal: I(μ,σ2)=diag(1/σ2,1/(2σ4))I(\mu, \sigma^2) = \mathrm{diag}(1/\sigma^2, 1/(2\sigma^4)), so the asymptotic variance of σ^2\hat\sigma^2 is 2σ4/n2\sigma^4/n. The MLE is biased but consistent and asymptotically efficient.

Example

MLE for Bernoulli with degenerate sample

X1,,X10Bernoulli(p)X_1, \ldots, X_{10} \sim \mathrm{Bernoulli}(p) with iXi=0\sum_i X_i = 0. (p)=10log(1p)\ell(p) = 10 \log(1 - p), maximized at p^=0\hat p = 0. This is on the boundary of Θ=[0,1]\Theta = [0, 1], so asymptotic normality fails. The Wald CI p^±zp^(1p^)/n\hat p \pm z \sqrt{\hat p (1-\hat p)/n} collapses to a single point, which is exactly zero coverage. The fix is a Bayesian (Beta(1,1) prior gives posterior Beta(1, 11), mean 1/121/12) or a Wilson / Agresti-Coull / score-based CI that does not degenerate on the boundary. This is the canonical demonstration that MLE alone is insufficient near boundaries.

Example

MLE for a misspecified exponential

XiGamma(α,λ)X_i \sim \mathrm{Gamma}(\alpha, \lambda) with α=2\alpha = 2 (so the data is not exponential), but you fit the exponential family p(x;λ)=λeλxp(x; \lambda) = \lambda e^{-\lambda x}. The exponential MLE is λ^=1/Xˉ\hat\lambda = 1/\bar X. Under P0=Gamma(2,λ0)P_0 = \mathrm{Gamma}(2, \lambda_0): E[X]=2/λ0\mathbb E[X] = 2/\lambda_0, so λ^λ0/2\hat\lambda \to \lambda_0/2, the least-false parameter (it minimizes KL from Gamma(2, λ0\lambda_0) to Exponential(λ)(\lambda)). The exponential model's Fisher information at θ=λ0/2\theta = \lambda_0/2 is I=1/λ2=4/λ02I = 1/\lambda^2 = 4/\lambda_0^2, but the true score variance under P0P_0 is A=VarP0(s)=4/λ02(1/2)=2/λ02A = \mathrm{Var}_{P_0}(s) = 4/\lambda_0^2 \cdot (1/2) = 2/\lambda_0^2 (because the variance of XX under Gamma(2,λ0)(2, \lambda_0) is 2/λ022/\lambda_0^2, not 4/λ024/\lambda_0^2). The naive variance I1/n=λ02/(4n)I^{-1}/n = \lambda_0^2/(4n) is a factor of 2 too large; the sandwich gives V1AV=λ02/(8n)V^{-1} A V^{-\top} = \lambda_0^2/(8n). Reporting I1I^{-1}-based CIs would over-cover by 50%.

Example

MLE = MoM in canonical exponential families

For a canonical-form exponential family with sufficient statistic TT, the score equation is Tˉn=μ(η^n)\bar T_n = \mu(\hat\eta_n). The MLE is obtained by matching the empirical first moment of the sufficient statistic to the model-implied first moment. So MLE coincides with the method of moments on the natural parameter, a fact that explains why GLMs with canonical links have closed-form gradients and why their Newton step is identical to the iteratively reweighted least-squares (IRLS) update.

Common Confusions

Watch Out

The likelihood is not a probability distribution over θ

Ln(θ)L_n(\theta) is a function of θ\theta with the data fixed. It does not integrate to one and is not a posterior. The Bayesian construction multiplies LnL_n by a prior and renormalizes; without a prior, "the probability of θ\theta" is not even well-defined under the frequentist worldview that produced MLE. The MLE is the mode of LnL_n, not the mean, median, or mode of any posterior; those depend on the prior.

Watch Out

MLE is biased in finite samples and that is a separate question from asymptotic efficiency

"Asymptotically efficient" means the asymptotic variance hits the Cramér-Rao bound. It does not mean the finite-sample MLE is unbiased or even minimum-MSE. The Gaussian variance MLE is biased; ridge regression beats OLS in MSE; the James-Stein estimator beats the multivariate Gaussian MLE in MSE for d3d \geq 3. Asymptotic efficiency and finite-sample optimality are different criteria, and the textbook tradition that conflates them is misleading. See empirical-risk-minimization and bias-variance tradeoff for the modern treatment.

Watch Out

James-Stein: MLE can be inadmissible in d ≥ 3 dimensions

Let XN(θ,Id)X \sim \mathcal N(\theta, I_d) with d3d \geq 3. The MLE is θ^MLE=X\hat\theta_{\mathrm{MLE}} = X. The James-Stein estimator θ^JS=(1(d2)/X2)X\hat\theta_{\mathrm{JS}} = (1 - (d-2)/\| X \|^2) X shrinks XX toward zero and has strictly smaller MSE for every θ\theta. This does not contradict Cramér-Rao because James-Stein is biased. The lesson: in high dimensions, the unbiased MLE is inadmissible, and shrinkage/regularization is not a kludge but a structural improvement. This is the theoretical justification for ridge, LASSO, weight decay, dropout, and every other regularizer in modern ML.

Watch Out

Consistency requires correct specification (or you get the least-false parameter)

If P0{pθ}P_0 \notin \{p_\theta\}, the MLE converges to θ=argminθDKL(P0pθ)\theta^* = \arg\min_\theta D_{\mathrm{KL}}(P_0 \| p_\theta), not to any "true" θ0\theta_0 (which does not exist). This is still useful (you get the closest model in KL), but the asymptotic variance is the sandwich V1AVV^{-1} A V^{-\top}, not I1I^{-1}, and 2logΛn-2 \log \Lambda_n does not have a clean χ2\chi^2 distribution (Vuong 1989). Reporting I1I^{-1}-based confidence intervals on a misspecified model is anti-conservative and one of the most common subtle errors in applied likelihood work.

Watch Out

MLE existence and uniqueness require structural conditions

The MLE may not exist (e.g., a Gaussian mixture with one component collapsing onto a single data point gives infinite likelihood). It may not be unique (multimodal n\ell_n, label switching in mixtures, overparameterized neural networks). Existence and uniqueness require concavity of n\ell_n (canonical exponential families), compactness of Θ\Theta, or identifiability, conditions that often need to be imposed or verified explicitly rather than assumed.

Watch Out

Reparameterization invariance is a feature of the MLE, not a bug

If η=g(θ)\eta = g(\theta) for a smooth bijection gg, the MLE of η\eta is g(θ^MLE)g(\hat\theta_{\mathrm{MLE}}). This invariance is what makes MLE-based inference clean across reparameterizations (canonical vs. mean parameter in exponential families, log-odds vs. probability in logistic regression). Bayesian posteriors are not invariant in the same way (the maximum a posteriori estimate transforms by the Jacobian), so the MAP and MLE disagree under nonlinear reparameterization even with a flat prior. This is a well-known wart of MAP estimation that does not afflict MLE.

Optional Extensions: Non-Regular MLE — Where Asymptotic Theory Breaks

Standard MLE asymptotics assume the regularity conditions stated above (smooth log-density on a fixed support, interior parameter, identifiable, finite Fisher information, etc.). When any of these fails, the standard theorems do not apply — and the failure modes are not pathological edge cases. Three are common enough in practice that confusing them with the regular case produces real bugs.

1. Support depends on the parameter — Uniform [0,θ][0, \theta]

The textbook example. Let X1,,XnUniform[0,θ]X_1, \ldots, X_n \sim \text{Uniform}[0, \theta] with θ>0\theta > 0 unknown. The likelihood is

Ln(θ)=i=1n1θ1[0Xiθ]=θn1[θmaxiXi],L_n(\theta) = \prod_{i=1}^n \frac{1}{\theta} \mathbf 1[0 \le X_i \le \theta] = \theta^{-n}\, \mathbf 1[\theta \ge \max_i X_i],

a piecewise function: zero for θ<X(n)\theta < X_{(n)} (the maximum), then θn\theta^{-n} decreasing in θ\theta for θX(n)\theta \ge X_{(n)}. The MLE is θ^n=X(n)=maxiXi\hat\theta_n = X_{(n)} = \max_i X_i.

Why standard theory fails. The support {x:0xθ}\{x : 0 \le x \le \theta\} depends on θ\theta, so you cannot differentiate under the integral sign in the score-equation derivation. The Fisher information diverges (informally, the likelihood has a discontinuous edge at θ=X(n)\theta = X_{(n)}). The asymptotic-normality theorem assumes a smooth log-likelihood with finite Fisher information; both fail here.

What actually happens. A direct calculation: P(X(n)t)=(t/θ)n\mathbb P(X_{(n)} \le t) = (t/\theta)^n for t[0,θ]t \in [0, \theta], so the gap θX(n)\theta - X_{(n)} has CDF 1((θt)/θ)n1 - ((\theta - t)/\theta)^n on [0,θ][0, \theta]. Letting n(θX(n))=Tnn(\theta - X_{(n)}) = T_n,

P(Tns)=1(1snθ)nn1es/θ,\mathbb P(T_n \le s) = 1 - \left(1 - \frac{s}{n\theta}\right)^n \xrightarrow{n \to \infty} 1 - e^{-s/\theta},

so TndExp(1/θ)T_n \xrightarrow{d} \text{Exp}(1/\theta). The MLE converges at rate 1/n1/n (not 1/n1/\sqrt n) to an exponential limit (not Gaussian). The bias is E[θ^n]=nθ/(n+1)\mathbb E[\hat\theta_n] = n\theta/(n+1), so θE[θ^n]=θ/(n+1)\theta - \mathbb E[\hat\theta_n] = \theta/(n+1), an order of n1n^{-1} that the bias-corrected estimator ((n+1)/n)X(n)((n+1)/n) X_{(n)} removes exactly. Standard 1/n1/\sqrt n confidence intervals are wildly wrong; you need the exponential limit law.

Pattern. Whenever the log-density has a discontinuity at a parameter-dependent boundary, expect a non-Gaussian, faster-than-n\sqrt n rate. This pattern recurs in change-point detection, threshold autoregressions, and any model where the parameter pins down the support.

2. Boundary parameter — variance components in mixed models

Setup: yi=xiβ+bi+εiy_i = x_i^\top \beta + b_i + \varepsilon_i with biN(0,σb2)b_i \sim \mathcal N(0, \sigma_b^2) and εiN(0,σ2)\varepsilon_i \sim \mathcal N(0, \sigma^2) independent. The variance component σb20\sigma_b^2 \ge 0 is non-negative; the parameter space has a boundary at σb2=0\sigma_b^2 = 0.

Why standard theory fails. Asymptotic normality of the MLE assumes the true parameter is in the interior of Θ\Theta. When σb2=0\sigma_b^2 = 0 (no random-intercept variation), the true parameter sits on the boundary; the score is constrained to a half-space and the limiting distribution is no longer centered Gaussian.

What actually happens (Self & Liang 1987). Under H0:σb2=0H_0: \sigma_b^2 = 0, the MLE σ^b2\hat\sigma^2_b has a mixture distribution: with probability 1/21/2 it equals zero exactly (the boundary), and with probability 1/21/2 it has a half-normal density above zero. The likelihood-ratio statistic 2logΛn-2\log\Lambda_n converges to a 50:50 mixture of χ02\chi^2_0 (point mass at zero) and χ12\chi^2_1, not to χ12\chi^2_1. The naive χ12\chi^2_1 test is over-conservative and rejects too rarely.

Practical implication. Any null hypothesis that puts a variance component, mixture weight, or other non-negative parameter at the boundary triggers this. Software that reports χ12\chi^2_1-based p-values for "is the random intercept significant" is conservative — multiplying the reported p-value by 1/21/2 is the standard adjustment. The same correction applies in lme4 / nlme variance-component tests when the alternative places one variance at the boundary.

3. Identifiability fails — testing one mixture component vs two

Setup: X1,,XnpN(μ1,1)+(1p)N(μ2,1)X_1, \ldots, X_n \sim p\, \mathcal N(\mu_1, 1) + (1 - p)\, \mathcal N(\mu_2, 1). Test H0:p=1H_0: p = 1 (one component, the second is irrelevant) vs H1:p[0,1]H_1: p \in [0, 1].

Why standard theory fails. Under H0H_0, the parameter μ2\mu_2 is not identified: with p=1p = 1, the data has no information about μ2\mu_2, so the likelihood is flat in that coordinate. Wilks' theorem assumes identifiability of all parameters under both hypotheses; here it fails because the nuisance parameter μ2\mu_2 is identified only when p1p \ne 1.

What actually happens (Davies 1977, 1987). The LRT statistic 2logΛn-2 \log \Lambda_n does not converge to χq2\chi^2_q. The limit is the supremum of a Gaussian process indexed by μ2\mu_2:

2logΛndsupμ2MZ(μ2)2,-2 \log \Lambda_n \xrightarrow{d} \sup_{\mu_2 \in \mathcal M} Z(\mu_2)^2,

where Z()Z(\cdot) is a mean-zero Gaussian process whose covariance depends on the design and on the assumed distribution of μ2\mu_2 under the alternative. This limit has heavier tails than χ2\chi^2, and naive χ2\chi^2-based p-values are anti-conservative (reject too often). Davies' upper-bound formula gives a usable correction; in practice BIC and parametric bootstrap are common alternatives, particularly for selecting the number of components in finite mixture models.

Pattern. Whenever a parameter is only identified under the alternative, Wilks fails and Davies' Gaussian-process supremum is the right limit. This pattern recurs in regime-switching models, hidden-state count selection in HMMs, and any "is this extra structure needed" test.

Why the failure cases compound

The three failures interact. Mixture-component selection is both a non-identifiability problem (Davies) and a boundary-parameter problem (the mixture weight pp sits at the boundary p=1p = 1 under H0H_0). The combined limit is the supremum of a half-normal Gaussian process, with constants that have to be computed numerically. Self & Liang (1987) and Lin (2002) work out cases. The general rule: when in doubt, run a parametric bootstrap; the asymptotic theory is too fragile to trust in every nonstandard setting.

For nuisance-parameter elimination via profile likelihood, see Davison (2003, ch. 4-5); the profile-likelihood ratio inherits the regularity properties of the underlying full-likelihood theory under interior parameters but breaks the same way at boundaries and under non-identification.

Exercises

ExerciseCore

Problem

Compute the MLE and asymptotic variance for the Exponential model: X1,,XnExp(λ)X_1, \ldots, X_n \sim \mathrm{Exp}(\lambda), density p(x;λ)=λeλxp(x; \lambda) = \lambda e^{-\lambda x} on x>0x > 0. Then derive the asymptotic distribution of logλ^n\log \hat\lambda_n via the delta method and use it to construct a confidence interval that respects λ>0\lambda > 0.

ExerciseCore

Problem

Show that for X1,,X10Bernoulli(p)X_1, \ldots, X_{10} \sim \mathrm{Bernoulli}(p) with Xi=0\sum X_i = 0, the MLE is p^=0\hat p = 0 and the symmetric Wald CI degenerates. Compute the Bayesian posterior under a Beta(1, 1) prior, give the posterior mean, and compare to the Wilson score CI p^W±zp^W(1p^W)/n+z2/(4n2)/(1+z2/n)\hat p_W \pm z \sqrt{\hat p_W (1 - \hat p_W)/n + z^2/(4n^2)} / (1 + z^2/n) where p^W=(k+z2/2)/(n+z2)\hat p_W = (k + z^2/2)/(n + z^2) with z=1.96z = 1.96.

ExerciseAdvanced

Problem

Prove the Cramér-Rao bound in the multivariate case: if T(X)RdT(X) \in \mathbb R^d is unbiased for θRd\theta \in \mathbb R^d, then Covθ(T)I(θ)1\mathrm{Cov}_\theta(T) \succeq I(\theta)^{-1} (single-observation case; the i.i.d. version uses nI(θ)n I(\theta)). Use the matrix Cauchy-Schwarz: for any random vectors U,VU, V with Cov(V)\mathrm{Cov}(V) invertible, Cov(U)Cov(U,V)Cov(V)1Cov(U,V)\mathrm{Cov}(U) \succeq \mathrm{Cov}(U, V) \mathrm{Cov}(V)^{-1} \mathrm{Cov}(U, V)^\top.

ExerciseAdvanced

Problem

Logistic regression has log-likelihood (β)=i[yixiβlog(1+exiβ)]\ell(\beta) = \sum_i [y_i x_i^\top \beta - \log(1 + e^{x_i^\top \beta})] for yi{0,1}y_i \in \{0, 1\} and xiRdx_i \in \mathbb R^d. (a) Derive the score and Fisher information. (b) Show that \ell is concave. (c) Show that the Newton step is identical to one step of iteratively reweighted least squares (IRLS) with weights wi=πi(1πi)w_i = \pi_i(1 - \pi_i) where πi=σ(xiβ)\pi_i = \sigma(x_i^\top \beta). (d) When does the MLE not exist? (Hint: think about complete linear separation.)

ExerciseResearch

Problem

Wilks' theorem under misspecification. Suppose the true distribution is P0P_0 and you fit two nested model families P0P1\mathcal P_0 \subset \mathcal P_1, both potentially misspecified. Let θj\theta^*_j be the least-false parameter in Pj\mathcal P_j. Show that under a contiguous hypothesis θ0=θ1\theta^*_0 = \theta^*_1 (so the projections coincide even though neither projects onto P0P_0), the asymptotic distribution of 2logΛn-2 \log \Lambda_n is no longer χq2\chi^2_q but a weighted sum of independent χ12\chi^2_1 variables with weights given by the eigenvalues of AV1A V^{-1} (Vuong 1989). Derive the weights for the case where P1\mathcal P_1 is a Gaussian regression with qq extra covariates and the data is heteroskedastic.

Numerical Optimization in Practice

The MLE rarely has a closed form outside of exponential families. Standard methods:

  • Newton-Raphson. θt+1=θt(2n)1n\theta_{t+1} = \theta_t - (\nabla^2 \ell_n)^{-1} \nabla \ell_n. Quadratic convergence near the optimum but requires the Hessian; can diverge if started far from the truth or near a saddle.
  • Fisher scoring. Replace the observed Hessian by the expected (Fisher) information: θt+1=θt+I(θt)1n\theta_{t+1} = \theta_t + I(\theta_t)^{-1} \nabla \ell_n. Often more numerically stable than Newton because I(θ)I(\theta) is positive definite by construction; identical to Newton in canonical-form exponential families.
  • EM. When the likelihood involves latent variables (mixtures, hidden Markov models, factor analysis), the EM algorithm iterates an expectation step and a maximization step; each iteration is guaranteed to increase n\ell_n but converges only to a local optimum.
  • Stochastic gradient descent. For large datasets and complex models (deep neural networks), full-batch Newton is infeasible. SGD on the per-sample negative log-likelihood is the dominant approach, with adaptive variants (Adam, AdamW) handling ill-conditioning. The connection to the asymptotic theory above is loose: SGD does not converge to the MLE in any clean asymptotic sense without infinite passes, and modern overparameterized models live in the regime where the MLE is not unique (every zero-loss solution is an MLE).

Summary

  • Definition. θ^n=argmaxθilogp(Xi;θ)\hat\theta_n = \arg\max_\theta \sum_i \log p(X_i; \theta), the parameter that maximizes the joint density of observed data.
  • Score. s(θ;x)=θlogp(x;θ)s(\theta; x) = \nabla_\theta \log p(x; \theta), mean-zero at θ0\theta_0.
  • Bartlett identities. E[s]=0\mathbb E[s] = 0 and Cov(s)=I(θ)=E[2logp]\mathrm{Cov}(s) = I(\theta) = -\mathbb E[\nabla^2 \log p]; the engine behind every classical MLE asymptotic.
  • Consistency. θ^nPθ0\hat\theta_n \xrightarrow{P} \theta_0 under identifiability + Wald's compactness or concavity.
  • Asymptotic normality. n(θ^nθ0)dN(0,I1)\sqrt n (\hat\theta_n - \theta_0) \xrightarrow{d} \mathcal N(0, I^{-1}) under regularity (or DQM).
  • Wilks. 2logΛndχq2-2 \log \Lambda_n \xrightarrow{d} \chi^2_q under H0H_0, non-central χq2\chi^2_q under contiguous alternatives.
  • Cramér-Rao. Var(T)1/(nI(θ))\mathrm{Var}(T) \geq 1/(n I(\theta)) for unbiased TT; MLE attains this asymptotically.
  • QMLE. Under misspecification θ^nθ=argminDKL(P0pθ)\hat\theta_n \to \theta^* = \arg\min D_{\mathrm{KL}}(P_0 \| p_\theta), with sandwich variance V1AVV^{-1} A V^{-\top}.
  • Exponential families. Score is T(x)μ(η)T(x) - \mu(\eta); MLE = MoM on the canonical parameter; Newton = Fisher scoring = IRLS.
  • Modern ML. Cross-entropy training, perplexity training, ELBO, diffusion NLL: all are MLE in disguise.
  • Failure modes. Boundary parameters, non-identifiability, Neyman-Scott, infinite-dimensional families, complete separation in logistic regression.

Related Comparisons

References

Canonical:

  • van der Vaart, Asymptotic Statistics (Cambridge UP, 1998). Chapter 5 (M- and Z-estimators), Chapter 7 (LAN), Chapter 16 (LRT and Wilks). The single most-used reference for modern likelihood asymptotics.
  • Lehmann & Casella, Theory of Point Estimation, 2nd ed. (Springer, 1998). Chapter 6 collects regularity conditions, the Cramér-Rao bound, and the MLE asymptotics in classical form.
  • Casella & Berger, Statistical Inference, 2nd ed. (Duxbury, 2002). Chapters 7 and 10 give the standard graduate-textbook treatment with worked examples.
  • Cramér, Mathematical Methods of Statistics (Princeton UP, 1946). The original detailed treatment of MLE asymptotics; still readable and clarifying on the regularity conditions.
  • Wald, Note on the consistency of the maximum likelihood estimate (Annals of Math. Stat., 1949). The classical compactness-based consistency proof.
  • Le Cam, Asymptotic Methods in Statistical Decision Theory (Springer, 1986). The deepest treatment of LAN, contiguity, and Hájek's convolution theorem; the modern foundation for "MLE is asymptotically optimal".
  • Bartlett, Approximate confidence intervals II (Biometrika, 1953). The Bartlett identities, derived as cumulant expansions of the log-likelihood.

Current:

  • Wasserman, All of Statistics (Springer, 2004). Chapters 9-10 give a compact summary; useful for first-pass exposure.
  • Keener, Theoretical Statistics: Topics for a Core Course (Springer, 2010). Chapters 7-9 build MLE asymptotics with a stronger focus on examples than van der Vaart.
  • Murphy, Probabilistic Machine Learning: An Introduction (MIT Press, 2022). Chapter 4 covers MLE in the deep-learning context: cross-entropy, NLL, regularization, and the connection to ERM.
  • White, Estimation, Inference, and Specification Analysis (Cambridge UP, 1994). The book-length treatment of QMLE and sandwich-variance inference for econometrics.
  • Newey & McFadden, Large Sample Estimation and Hypothesis Testing (Handbook of Econometrics IV, 1994). Modern reformulation of MLE as a special case of M-estimation; explicit treatment of misspecification and IV.
  • Davison, Statistical Models (Cambridge UP, 2003). Chapters 4-5 cover likelihood theory at the level needed for applied work; particularly good on profile and conditional likelihoods.

Critique:

  • White, Maximum Likelihood Estimation of Misspecified Models (Econometrica, 1982). The foundational paper on QMLE: shows the MLE converges to the least-false parameter and derives the sandwich variance.
  • Vuong, Likelihood ratio tests for model selection and non-nested hypotheses (Econometrica, 1989). Shows that Wilks' theorem fails under misspecification and constructs a corrected LRT whose limit is a weighted χ2\chi^2 mixture.
  • Self & Liang, Asymptotic properties of MLE and LRT under nonstandard conditions (JASA, 1987). The boundary-parameter case; the LRT limit becomes a mixture of χ2\chi^2 distributions.
  • Hodges (1951), discussed in van der Vaart Theorem 8.1. The super-efficient estimator that beats the MLE on a measure-zero set, motivating the regular qualifier in Hájek's convolution theorem.
  • Davies, Hypothesis testing when a nuisance parameter is present only under the alternative (Biometrika, 1977 and 1987). Mixture-component testing where Wilks fails because the nuisance parameter is unidentified under the null.
  • Stein, Inadmissibility of the usual estimator for the mean of a multivariate normal distribution (Berkeley Symposium, 1956); James & Stein, Estimation with quadratic loss (1961). The original demonstration that MLE is inadmissible in d3d \geq 3; the foundational result for modern shrinkage and regularization theory.

Next Topics

Last reviewed: May 4, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.