Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiency

Sneiderman, Robby

Statistical Estimation

Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiency

MLE: pick the parameter that maximizes the likelihood of observed data. Score function, Bartlett identities, regularity conditions, consistency, asymptotic normality, Wilks' theorem, Cramér-Rao efficiency, exponential families, QMLE under misspecification, and the bridge to deep-learning negative log-likelihood training.

CoreTier 1StableCore spine~90 min

Prerequisites

Common Probability Distributions Differentiation in Rn Central Limit Theorem KL Divergence

Quiz (52)Pulse Check Prereq Map

Why This Matters

Maximum likelihood estimation is the workhorse of parametric inference, and almost every modern training objective is a likelihood in disguise. Train a logistic regression by minimizing cross-entropy loss and you are running MLE on a Bernoulli/categorical model. Fit a Gaussian mixture model and you are running MLE via EM. Train a language model by minimizing perplexity and you are running MLE on a categorical sequence model. Even modern generative models (variational autoencoders, normalizing flows, autoregressive transformers) train against the negative log-likelihood (NLL) or a tractable bound on it.

Five-panel infographic on MLE: setup with i.i.d. data and likelihood as a product of densities, log-likelihood algebra (turns product into sum, monotone transformation preserves argmax), geometric intuition with the likelihood function peaked at theta_MLE, worked example for normal mean (theta_MLE = sample mean), and why MLE matters (consistency, asymptotic normality, invariance under reparameterization), with the warning that the likelihood is not a probability distribution over theta unless a prior is introduced. — MLE picks the parameter value under which the observed data look least surprising. It is consistent and asymptotically normal under regularity conditions, and invariant under one-to-one reparameterizations.

Five questions decide whether the MLE is the right tool for a given problem:

Does it converge to the truth? Consistency: $\hat\theta_n \xrightarrow{P} \theta_0$ under identifiability and standard regularity (Cramér 1946; Wald 1949).
How accurate is it? Asymptotic normality: $\sqrt{n}(\hat\theta_n - \theta_0) \xrightarrow{d} \mathcal N(0, I(\theta_0)^{-1})$ under Cramér-type regularity.
Can any unbiased estimator do better? Cramér-Rao: no regular unbiased estimator has smaller variance than $1/(nI(\theta_0))$ . The MLE attains this bound asymptotically.
What if the model is wrong? White's QMLE: the MLE converges to the least-false parameter $\theta^* = \arg\min_\theta D_{\mathrm{KL}}(P_0 \| p_\theta)$ , with sandwich variance $V^{-1} A V^{-\top}$ replacing $I^{-1}$ .
What can we test? Wilks' theorem: $-2 \log \Lambda_n \xrightarrow{d} \chi^2_q$ under $H_0$ (with non-central $\chi^2$ under contiguous alternatives), so the likelihood-ratio statistic has a calibrated null distribution.

Each answer requires the information identity $I(\theta) = -\mathbb E_\theta[\nabla^2 \log p(X;\theta)] = \mathbb E_\theta[s(\theta;X) s(\theta;X)^\top]$ , which fails the moment regularity does, and recognizing those failures is what separates a working statistician from a textbook user.

Likelihood gives one geometric object and three equally important readings

Maximum likelihood picks the parameter where the log-likelihood peaks.

Optimization target

$\hat{θ}_{MLE} = ar g max_{θ} ℓ_{n} (θ)$

$The MLE is the parameter value where the observed data has the largest joint density under the model.$

Why logs matter

$ℓ_{n} (θ) = \sum_{i} lo g p (X_{i}; θ)$

$The log transform turns products into sums, making numerical optimization and asymptotic analysis possible.$

ML translation

$Cross-entropy, token negative log-likelihood, and many probabilistic deep-learning losses are MLE objectives in disguise.$

Mental Model

You observe $x_1, \ldots, x_n$ from $p(x \mid \theta_0)$ with unknown $\theta_0 \in \Theta \subseteq \mathbb R^d$ . The MLE asks: which $\theta$ assigns the largest joint density to what you actually saw? Two equivalent framings:

Pick the parameter under which your data is least surprising. Each datum contributes a log-likelihood $\log p(x_i; \theta)$ ; the MLE maximizes the sum.
Minimize empirical cross-entropy against the model. $\hat\theta_n = \arg\min_\theta -\frac{1}{n}\sum_{i=1}^n \log p_\theta(x_i)$ . This form is what most ML loss functions are: minimizing categorical cross-entropy or negative log-likelihood is exactly MLE. For discrete models with shared support, this also equals the information projection $\arg\min_\theta D_{\mathrm{KL}}(\hat P_n \| p_\theta)$ in Csiszár's sense, since the empirical entropy is $\theta$ -independent. For continuous-density models the empirical $\hat P_n$ is a discrete mixture of point masses, so $D_{\mathrm{KL}}(\hat P_n \| p_\theta) = \infty$ in the strict measure-theoretic sense; the KL framing then holds only as a formal/heuristic identity, while empirical cross-entropy minimization holds rigorously.

The log transform converts products into sums, which buys two things at once: numerical stability (no underflow on long sequences) and access to the law of large numbers and central limit theorem (sums of i.i.d. terms are exactly what asymptotic statistics is designed to handle). This single change turns finite-sample maximum-density into an asymptotic theory of estimators.

Formal Setup and Notation

Let $X_1, \ldots, X_n$ be i.i.d. from $p(x; \theta_0)$ where $\theta_0 \in \Theta \subseteq \mathbb R^d$ . The parametric family $\mathcal P = \{p(\cdot ; \theta) : \theta \in \Theta\}$ is the statistical model. The dominating measure is $\mu$ (Lebesgue on $\mathbb R^d$ for continuous data, counting on $\mathbb Z$ for discrete data); $p$ is a density with respect to $\mu$ .

Definition

Likelihood Function $L_{n} (θ)$

The likelihood function is

$L_n(\theta) = \prod_{i=1}^n p(X_i; \theta).$

It is a function of $\theta$ with the data fixed. Despite looking like a density on the data space, $L_n(\theta)$ is not a probability distribution over $\theta$ (it does not integrate to one over $\Theta$ ). Bayesians multiply by a prior $\pi(\theta)$ and renormalize to obtain a posterior; frequentists treat $L_n$ as the input to estimation and testing.

Definition

Log-Likelihood $ℓ_{n} (θ)$

The log-likelihood is

$\ell_n(\theta) = \sum_{i=1}^n \log p(X_i; \theta).$

The log transform turns products into sums, which is required for the law of large numbers, the central limit theorem, and floating-point stability on long product chains.

Definition

Maximum Likelihood Estimator $\hat{θ}_{n}$

The MLE is any measurable maximizer

$\hat\theta_n \in \arg\max_{\theta \in \Theta} \ell_n(\theta).$

When $\Theta$ is open and $\ell_n$ is differentiable, $\hat\theta_n$ solves the score equation $\sum_{i=1}^n s(\theta; X_i) = 0$ where $s(\theta; x) = \nabla_\theta \log p(x; \theta)$ . The MLE may not exist (infinite or non-attained supremum), may not be unique (multimodal $\ell_n$ ), and may sit on the boundary of $\Theta$ . Existence and uniqueness require structural assumptions (compactness of $\Theta$ , concavity of $\ell_n$ , identifiability, or all three).

Score Function and Fisher Information

Definition

Score Function $s (θ; x)$

The score function is the gradient of the log-density:

$s(\theta; x) = \nabla_\theta \log p(x; \theta) = \frac{\nabla_\theta p(x; \theta)}{p(x; \theta)}.$

Under regularity (interchange of differentiation and integration), the score has mean zero at the true parameter:

$\mathbb E_{\theta_0}[s(\theta_0; X)] = \int \nabla_\theta p(x; \theta_0) \, d\mu(x) = \nabla_\theta \int p(x; \theta_0) \, d\mu(x) = \nabla_\theta 1 = 0.$

The MLE satisfies $\sum_i s(\hat\theta_n; X_i) = 0$ , the empirical analog of $\mathbb E[s(\theta_0; X)] = 0$ .

Definition

Fisher Information $I (θ)$

The (per-observation, expected) Fisher information at $\theta$ is

$I(\theta) = \mathbb E_\theta[s(\theta; X) \, s(\theta; X)^\top] = \mathrm{Cov}_\theta(s(\theta; X)),$

a $d \times d$ positive semidefinite matrix. Under second-order regularity (twice differentiability of $\log p$ and a second-order interchange) it equals the negative expected Hessian:

$I(\theta) = -\mathbb E_\theta[\nabla^2_\theta \log p(X; \theta)].$

For $n$ i.i.d. observations the total information is $n I(\theta)$ . Fisher information measures how sharply the log-likelihood concentrates around the truth: large $I(\theta)$ means a steep peak (easy to estimate), small $I(\theta)$ means a flat surface (hard to estimate).

Definition

Observed Fisher Information $J_{n} (θ)$

The (data-dependent) observed Fisher information is

$J_n(\theta) = -\frac{1}{n} \sum_{i=1}^n \nabla^2_\theta \log p(X_i; \theta).$

By the law of large numbers $J_n(\theta) \xrightarrow{P} I(\theta)$ . In practice one reports $J_n(\hat\theta_n)^{-1}/n$ as a finite-sample variance estimate; under misspecification $J_n(\hat\theta_n)$ does not equal the population $I(\theta)$ and the sandwich form is needed (see QMLE below).

The Information Identity

The two formulas $I(\theta) = \mathbb E[ss^\top]$ and $I(\theta) = -\mathbb E[\nabla^2 \log p]$ are not separate definitions; they are equal under regularity, and the equality is the single most-used identity in likelihood theory. Bartlett (1953) generalized this to a family of cumulant identities obtained by repeated differentiation of $\int p = 1$ .

Theorem

Bartlett Identities (Information Equality)

Statement

Under regularity, the score has mean zero and Fisher information equals the negative expected Hessian:

$\mathbb E_\theta[s(\theta; X)] = 0, \qquad \mathrm{Cov}_\theta(s(\theta; X)) = I(\theta) = -\mathbb E_\theta[\nabla^2_\theta \log p(X; \theta)].$

Equivalently, $\mathbb E_\theta[\nabla_\theta s(\theta; X) + s(\theta; X) s(\theta; X)^\top] = 0$ , the second Bartlett identity.

Intuition

Differentiate $1 = \int p(x; \theta) \, d\mu(x)$ once: $0 = \int \nabla_\theta p = \int s \cdot p \, d\mu = \mathbb E[s]$ . Differentiate again:

$0 = \int \nabla^2_\theta p \, d\mu = \int (\nabla_\theta s + s s^\top) p \, d\mu = \mathbb E[\nabla_\theta s] + \mathbb E[s s^\top].$

Since $\mathbb E[\nabla_\theta s] = \mathbb E[\nabla^2_\theta \log p]$ , this rearranges to $-\mathbb E[\nabla^2 \log p] = \mathbb E[s s^\top] = I(\theta)$ . The identity is bookkeeping that turns one expression for $I$ into the other; the content is that you can interchange differentiation and integration.

Proof Sketch

Step 1 (first identity). Differentiate $\int p(x; \theta) \, d\mu(x) = 1$ with respect to $\theta$ . Pass the derivative inside the integral (assumption). Use $\nabla p = s \cdot p$ . Conclude $\mathbb E[s] = 0$ .

Step 2 (second identity). Differentiate again. By the same interchange,

$0 = \int \nabla^2 p \, d\mu = \int \big( \nabla s + s s^\top \big) p \, d\mu.$

A cleaner path is to differentiate the score directly. From $s = \nabla \log p = \nabla p / p$ , differentiate:

$\nabla s = \frac{\nabla^2 p}{p} - \frac{(\nabla p)(\nabla p)^\top}{p^2} = \frac{\nabla^2 p}{p} - s s^\top.$

Take expectation under $\theta$ , using $\mathbb E[\nabla^2 p / p] = \int \nabla^2 p \, d\mu = 0$ from Step 1's interchange:

$\mathbb E[\nabla s] = -\mathbb E[s s^\top] = -I(\theta).$

Since $\nabla s = \nabla^2 \log p$ , this gives $\mathbb E[\nabla^2 \log p] = -I(\theta)$ , equivalently $I(\theta) = -\mathbb E[\nabla^2 \log p]$ .

Why It Matters

The information identity is the linchpin of three subsequent results:

Asymptotic normality of the MLE uses $-\mathbb E[\nabla^2 \log p] = I(\theta)$ to identify the variance.
Cramér-Rao uses $\mathrm{Cov}(s) = I(\theta)$ in the Cauchy-Schwarz step.
Wilks' theorem uses both forms when expanding $\log L(\hat\theta) - \log L(\theta_0)$ to second order.

When the identity fails (parameter on a boundary, support depending on $\theta$ , non-smooth $\log p$ ), every classical asymptotic for the MLE fails along with it, and you need M-estimator machinery (sandwich variance, cube-root rates, mixture- $\chi^2$ limits) instead.

Failure Mode

The interchange step requires more than two derivatives existing. Standard counterexamples: (1) Uniform $(0, \theta)$ with support $[0, \theta]$ depends on $\theta$ , so $\nabla \int p = 0$ but $\int \nabla p \neq 0$ ; the score is not even well-defined and the MLE is $X_{(n)}$ with rate $n$ , not $\sqrt n$ . (2) Models with a discontinuity in $p$ at a $\theta$ -dependent threshold (change-point models) violate the same interchange; the MLE has a non-Gaussian limit at rate $n$ . (3) Mixture models on the boundary ( $\pi = 0$ ) make $I(\theta)$ singular and the second identity meaningless.

report a correction →

Regularity Conditions

The asymptotic theorems below require the parametric family to be regular. The canonical Cramér-type list (van der Vaart 1998 Ch. 5; Lehmann-Casella 1998 Ch. 6):

Identifiability: $\theta_1 \neq \theta_2 \Rightarrow p(\cdot; \theta_1) \neq p(\cdot; \theta_2)$ as measures.
Common support: $\{x : p(x; \theta) > 0\}$ does not depend on $\theta$ .
Interior: $\theta_0 \in \mathrm{int}(\Theta)$ .
Smoothness: $\log p(x; \theta)$ is three times continuously differentiable in $\theta$ , with third derivatives uniformly bounded by an integrable envelope $M(x)$ in a neighborhood of $\theta_0$ .
Interchange: differentiation under the integral sign is valid through second order (justified by dominated convergence with the same envelope).
Information: $I(\theta_0)$ is finite and positive definite.

Le Cam (1970) replaced conditions 4-5 with the much weaker differentiability in quadratic mean (DQM): there exists a score-like vector $\dot \ell_{\theta_0}$ such that $\sqrt{p_{\theta_0 + h}} - \sqrt{p_{\theta_0}} - \tfrac 12 h^\top \dot \ell_{\theta_0} \sqrt{p_{\theta_0}} = o(\|h\|)$ in $L^2(\mu)$ . DQM holds for all common parametric families and yields the same asymptotic normality without requiring three derivatives or an envelope. See asymptotic-statistics for the LAN treatment.

Main Theorems

Theorem

Consistency of the MLE (Wald)

Statement

Under Wald's conditions, the MLE is consistent:

$\hat\theta_n \xrightarrow{P} \theta_0 \quad \text{as } n \to \infty.$

If $\Theta$ is open (rather than compact), the same conclusion holds when $\ell_n$ is concave and $\theta_0$ is the unique maximizer of $\theta \mapsto \mathbb E_{\theta_0}[\log p(X; \theta)]$ .

Intuition

The empirical log-likelihood per observation $\bar\ell_n(\theta) = n^{-1} \sum_i \log p(X_i; \theta)$ is a sample average of i.i.d. terms. By the law of large numbers, $\bar \ell_n(\theta) \to L(\theta) := \mathbb E_{\theta_0}[\log p(X; \theta)]$ pointwise. The Kullback-Leibler inequality gives

$L(\theta_0) - L(\theta) = D_{\mathrm{KL}}(p_{\theta_0} \| p_\theta) \geq 0,$

with equality only when $p_{\theta_0} = p_\theta$ , i.e. (by identifiability) $\theta = \theta_0$ . So $\theta_0$ uniquely maximizes the limit objective. Under uniform convergence (Wald's compactness + envelope, or concavity), the maximizer of $\bar\ell_n$ converges to the maximizer of $L$ , which is $\theta_0$ .

Proof Sketch

Step 1 (LLN). For each $\theta$ , $\bar\ell_n(\theta) \xrightarrow{a.s.} L(\theta)$ by the strong law.

Step 2 (uniform convergence). Compactness + envelope yields $\sup_{\theta \in \Theta} | \bar\ell_n(\theta) - L(\theta) | \xrightarrow{a.s.} 0$ (uniform LLN; van der Vaart Lemma 19.36 or Newey-McFadden Lemma 2.4).

Step 3 (KL identifies the max). $L(\theta_0) - L(\theta) = D_{\mathrm{KL}}(p_{\theta_0} \| p_\theta) > 0$ for $\theta \neq \theta_0$ by Gibbs' inequality and identifiability.

Step 4 (argmax convergence). A standard argmax-continuity argument (van der Vaart Theorem 5.7) gives $\hat\theta_n \xrightarrow{P} \theta_0$ .

Why It Matters

Consistency is the minimum bar for any estimator and the prerequisite for asymptotic normality (which is local around $\theta_0$ ). The proof structure has three parts: LLN gives pointwise convergence; uniformity comes from compactness or concavity; KL identifies the maximizer. It is the template every modern consistency proof reuses (M-estimators, GMM, EM, deep learning's empirical loss).

Failure Mode

(1) Non-identifiability. Mixture models with label-switching: $(\pi, \mu_1, \mu_2)$ and $(1-\pi, \mu_2, \mu_1)$ give the same density. The KL gap collapses on the orbit, and the MLE is not consistent for any single labeling. (2) Boundary blow-up. In Gaussian mixtures, sending one component's variance to zero with its mean equal to a data point makes $L_n \to \infty$ . The MLE does not exist; constrained or penalized variants are needed. (3) Neyman-Scott. Parameter dimension growing with $n$ (one mean per pair of observations, one shared variance) gives an inconsistent variance MLE. Profile or modified likelihoods recover consistency. (4) Misspecification. If $p_{\theta_0}$ is not in the model family, the MLE converges to the least-false parameter $\arg\min_\theta D_{\mathrm{KL}}(P_0 \| p_\theta)$ , not to any "true" $\theta_0$ (see QMLE below).

report a correction →

Theorem

Asymptotic Normality of the MLE

Statement

Under regularity and consistency, the MLE is asymptotically normal:

$\sqrt{n}(\hat\theta_n - \theta_0) \xrightarrow{d} \mathcal N\!\left(0, I(\theta_0)^{-1}\right).$

Equivalently, $\hat\theta_n \approx \mathcal N(\theta_0, I(\theta_0)^{-1}/n)$ for large $n$ . The variance $I(\theta_0)^{-1}/n$ is the Cramér-Rao lower bound, attained by the MLE in the limit.

Intuition

Taylor-expand the score equation around $\theta_0$ :

$0 = \frac{1}{n} \sum_i s(\hat\theta_n; X_i) \approx \frac{1}{n} \sum_i s(\theta_0; X_i) + \left[\frac{1}{n} \sum_i \nabla s(\theta_0; X_i)\right] (\hat\theta_n - \theta_0).$

The first term, scaled by $\sqrt n$ , is a CLT for an i.i.d. mean-zero sum with covariance $I(\theta_0)$ (Bartlett identity I), so $n^{-1/2} \sum_i s(\theta_0; X_i) \xrightarrow{d} \mathcal N(0, I(\theta_0))$ . The bracket converges in probability to $\mathbb E[\nabla s(\theta_0; X)] = -I(\theta_0)$ (Bartlett identity II + LLN). Solving,

$\sqrt n (\hat\theta_n - \theta_0) \approx I(\theta_0)^{-1} \cdot n^{-1/2} \sum_i s(\theta_0; X_i) \xrightarrow{d} \mathcal N(0, I(\theta_0)^{-1} I(\theta_0) I(\theta_0)^{-1}) = \mathcal N(0, I(\theta_0)^{-1}).$

Proof Sketch

Step 1 (Taylor). Mean-value-expand the score: $0 = \bar s_n(\hat\theta_n) = \bar s_n(\theta_0) + \overline{\nabla s}_n(\tilde\theta) (\hat\theta_n - \theta_0)$ for some $\tilde\theta$ between $\hat\theta_n$ and $\theta_0$ .

Step 2 (CLT). $\sqrt n \cdot \bar s_n(\theta_0) = n^{-1/2} \sum_i s(\theta_0; X_i) \xrightarrow{d} \mathcal N(0, I(\theta_0))$ by the multivariate CLT (Bartlett identity I gives the variance).

Step 3 (LLN + consistency). $\overline{\nabla s}_n(\tilde\theta) \xrightarrow{P} \mathbb E[\nabla s(\theta_0; X)] = -I(\theta_0)$ by the uniform LLN plus continuity at $\theta_0$ and consistency of $\hat\theta_n$ .

Step 4 (Slutsky). Combining steps 2 and 3,

$\sqrt n (\hat\theta_n - \theta_0) = -\overline{\nabla s}_n(\tilde\theta)^{-1} \cdot \sqrt n \cdot \bar s_n(\theta_0) \xrightarrow{d} I(\theta_0)^{-1} \mathcal N(0, I(\theta_0)) = \mathcal N(0, I(\theta_0)^{-1}).$

Why It Matters

Asymptotic normality is the engine behind every classical inferential procedure built on the MLE: Wald confidence intervals $\hat\theta_n \pm z_{1-\alpha/2} \sqrt{I(\hat\theta_n)^{-1}/n}$ , Wald hypothesis tests, model comparison via AIC and BIC (which use the asymptotic distribution of the maximized log-likelihood). The functional form $\mathcal N(0, I^{-1})$ also plugs straight into the delta method for any smooth $g(\hat\theta_n)$ , giving CIs for derived quantities (odds ratios, hazard ratios, predicted probabilities).

Failure Mode

Asymptotic normality fails in four canonical cases. (1) $\theta_0$ sits on the boundary of $\Theta$ (variance equal to zero, mixture proportion equal to zero): the limit becomes a projection of a Gaussian onto a tangent cone (Self-Liang 1987). (2) Fisher information is singular (flat likelihood, weak identification): the rate slows below $\sqrt n$ or the limit becomes non-Gaussian. (3) The model is misspecified: replace $I(\theta_0)^{-1}$ with the sandwich $V^{-1} A V^{-\top}$ (QMLE, below). (4) Finite-sample concerns dominate, especially in high dimensions where the $\sqrt n$ regime requires $n \gg d^2$ in many examples.

report a correction →

Theorem

Wilks' Theorem (LRT Asymptotic χ²)

Statement

Let $\Lambda_n = L_n(\hat\theta_n^{(0)}) / L_n(\hat\theta_n)$ be the likelihood ratio comparing the constrained MLE under $H_0$ to the unconstrained MLE. Under the null,

$-2 \log \Lambda_n = 2 \big(\ell_n(\hat\theta_n) - \ell_n(\hat\theta_n^{(0)})\big) \xrightarrow{d} \chi^2_q,$

where $q$ is the codimension of the null (number of restrictions). Under contiguous alternatives $\theta_n = \theta_0 + h/\sqrt n$ , the limit is non-central $\chi^2_q(\delta)$ with non-centrality $\delta = h^\top I(\theta_0)_{\mathrm{rest}} h$ , where $I(\theta_0)_{\mathrm{rest}}$ is the restricted block of Fisher information.

Intuition

Taylor-expand the log-likelihood around the unconstrained MLE:

$\ell_n(\hat\theta_n) - \ell_n(\hat\theta_n^{(0)}) \approx \tfrac 12 (\hat\theta_n - \hat\theta_n^{(0)})^\top n I(\theta_0) (\hat\theta_n - \hat\theta_n^{(0)}) + o_P(1).$

The vector $\sqrt n (\hat\theta_n - \hat\theta_n^{(0)})$ is asymptotically Gaussian, projected onto the orthogonal complement of the null tangent space (a $q$ -dimensional subspace). Quadratic forms of $q$ -dim Gaussians with the right covariance are exactly $\chi^2_q$ .

Proof Sketch

Step 1. Under regularity, $\sqrt n (\hat\theta_n - \theta_0) \xrightarrow{d} Z \sim \mathcal N(0, I^{-1})$ and $\sqrt n (\hat\theta_n^{(0)} - \theta_0) \xrightarrow{d} P_{H_0} Z$ (projection onto the null tangent space in the Fisher metric).

Step 2. Second-order Taylor expansion of $\ell_n$ around $\hat\theta_n$ gives $2 \big(\ell_n(\hat\theta_n) - \ell_n(\hat\theta_n^{(0)})\big) = n (\hat\theta_n - \hat\theta_n^{(0)})^\top I(\theta_0) (\hat\theta_n - \hat\theta_n^{(0)}) + o_P(1)$ (the gradient at $\hat\theta_n$ is zero, the Hessian at $\theta_0$ converges to $-I(\theta_0)$ by Bartlett II).

Step 3. Substituting Step 1 into Step 2: the limit is $\| Z - P_{H_0} Z \|^2_{I}$ , the squared Fisher norm of the projection of $Z$ onto the orthogonal complement of $H_0$ . This equals $\| (I - P_{H_0}) Z \|^2_I = Z^\top (I - P_{H_0})^\top I (I - P_{H_0}) Z$ . Substituting $\mathrm{Var}(Z) = I^{-1}$ and choosing coordinates so that the orthogonal complement is the first $q$ axes gives $\chi^2_q$ .

Why It Matters

Wilks' theorem makes the likelihood-ratio test (LRT) practical: you can compute $-2 \log \Lambda_n$ and compare to a $\chi^2_q$ critical value without ever computing Fisher information by hand. It is the basis for nested-model comparison in regression (drop $q$ predictors and the LRT has a $\chi^2_q$ null), in GLMs (deviance differences), in mixed models, in survival analysis (Cox PH), and in modern model selection (BIC penalty derives from the Wilks limit). The non-central $\chi^2$ result under contiguous alternatives gives the asymptotic power of the test; see the LRT vs Wald vs score equivalence in asymptotic-statistics.

Failure Mode

Wilks' fails on the same boundaries that asymptotic normality does. Self-Liang (1987) shows that on a parameter-space boundary the limit is a mixture of $\chi^2$ distributions (e.g., for testing $\sigma^2 = 0$ in a mixed model the limit is $\tfrac 12 \chi^2_0 + \tfrac 12 \chi^2_1$ ). Under non-identifiability (testing whether a mixture component is present, the famous "two-component vs. one-component" test) the limit is non-standard and depends on a nuisance parameter under the alternative (Davies 1977). Don't apply Wilks blindly when the null is on a boundary or weakly identified.

report a correction →

Theorem

Cramér-Rao Lower Bound

Statement

Let $T = T(X_1, \ldots, X_n)$ be unbiased for $\theta \in \mathbb R$ . Then

$\mathrm{Var}_\theta(T) \geq \frac{1}{n I(\theta)}.$

In the multivariate case ( $\theta \in \mathbb R^d$ ), $\mathrm{Cov}_\theta(T) \succeq (n I(\theta))^{-1}$ in the Loewner order. An estimator attaining the bound is called efficient. The MLE achieves the bound asymptotically (the previous theorem).

Intuition

Differentiating $\mathbb E_\theta[T] = \theta$ in $\theta$ gives $\mathrm{Cov}(T, s_n) = 1$ where $s_n = \sum_i s(\theta; X_i)$ is the total score. Cauchy-Schwarz gives $1 = |\mathrm{Cov}(T, s_n)|^2 \leq \mathrm{Var}(T) \cdot \mathrm{Var}(s_n) = \mathrm{Var}(T) \cdot n I(\theta)$ . The information sets the price tag on precision: each unit of Fisher information buys a $1/I(\theta)$ unit of variance reduction.

Proof Sketch

Step 1. From $\mathbb E_\theta[T] = \theta$ , differentiate under the integral: $1 = \int T \, \nabla_\theta p \, d\mu = \int T \cdot s \cdot p \, d\mu = \mathbb E[T \cdot s_n]$ for $n=1$ ; the i.i.d. case generalizes to $\mathbb E[T \cdot s_n^{(\mathrm{tot})}] = 1$ .

Step 2. Since $\mathbb E[s] = 0$ , $\mathrm{Cov}(T, s_n^{(\mathrm{tot})}) = 1$ .

Step 3. Cauchy-Schwarz: $1 = |\mathrm{Cov}|^2 \leq \mathrm{Var}(T) \cdot \mathrm{Var}(s_n^{(\mathrm{tot})}) = \mathrm{Var}(T) \cdot n I(\theta)$ .

Step 4. Rearrange: $\mathrm{Var}(T) \geq 1 / (n I(\theta))$ .

Why It Matters

Cramér-Rao is the universal benchmark for unbiased estimators. The MLE attains it asymptotically (third theorem of this section), making MLE the asymptotically minimum-variance unbiased estimator in regular models. It also exposes a deep duality: minimum variance is the inverse of Fisher information, and Fisher information is the local Riemannian metric on the parameter manifold (Rao 1945; see information geometry). The bound's existence is what makes "no unbiased estimator can outperform the MLE" a precise statement rather than folklore.

Failure Mode

Cramér-Rao binds unbiased estimators only. Biased estimators can have strictly lower mean-squared error: $\mathrm{MSE} = \mathrm{Var} + \mathrm{Bias}^2$ , and a small bias can buy a large variance reduction. The James-Stein estimator is the canonical example (see Confusion below). Modern regularization (ridge, LASSO, weight decay, early stopping) trades a controlled bias for a large variance reduction and routinely beats the MLE in mean-squared error. In high dimensions the MLE is inadmissible and shrinkage is the rule, not the exception.

report a correction →

Theorem

Quasi-MLE under Misspecification (White)

Statement

Suppose $P_0 \neq p_\theta$ for any $\theta$ (the model is misspecified). Define

$V(\theta^*) = -\mathbb E_{P_0}[\nabla^2_\theta \log p_{\theta^*}(X)], \qquad A(\theta^*) = \mathbb E_{P_0}[s(\theta^*; X) \, s(\theta^*; X)^\top].$

Then the MLE converges to the least-false parameter $\theta^*$ and is asymptotically normal with the sandwich variance:

$\sqrt n (\hat\theta_n - \theta^*) \xrightarrow{d} \mathcal N(0, V(\theta^*)^{-1} A(\theta^*) V(\theta^*)^{-\top}).$

Under correct specification, the second Bartlett identity gives $V = A = I(\theta_0)$ and the sandwich collapses to $I(\theta_0)^{-1}$ .

Intuition

Without correct specification, the score still has mean zero at the least-false parameter (because $\theta^*$ minimizes KL, hence sets the gradient of $\mathbb E_{P_0}[\log p_\theta]$ to zero), but the score covariance and the Hessian are no longer linked by the information identity. The Taylor argument that gave $I^{-1}$ in the well-specified case now gives the more general M-estimator sandwich $V^{-1} A V^{-\top}$ . Reporting $I^{-1}$ (or, equivalently, $J_n^{-1}$ ) when the model is wrong gives anti-conservative confidence intervals that miss their nominal coverage.

Proof Sketch

Step 1. Mean-value expand $\bar s_n(\hat\theta_n) = 0$ around $\theta^*$ : $0 = \bar s_n(\theta^*) + \overline{\nabla s}_n(\tilde\theta)(\hat\theta_n - \theta^*)$ .

Step 2. Under $P_0$ : $\sqrt n \cdot \bar s_n(\theta^*) \xrightarrow{d} \mathcal N(0, A(\theta^*))$ (CLT with the $P_0$ -covariance, which is not equal to the model-implied Fisher information).

Step 3. $\overline{\nabla s}_n(\tilde\theta) \xrightarrow{P} -V(\theta^*)$ (LLN under $P_0$ + consistency).

Step 4. Slutsky: $\sqrt n (\hat\theta_n - \theta^*) \xrightarrow{d} V(\theta^*)^{-1} \mathcal N(0, A(\theta^*)) = \mathcal N(0, V^{-1} A V^{-\top})$ .

Why It Matters

Real models are always at least slightly wrong. Heteroskedastic linear regression fit by OLS, logistic regression with the wrong link, mixture models with the wrong number of components, neural networks fit to data not generated by them: all are quasi-MLE. The sandwich variance gives the correct asymptotic covariance, and robust standard errors (Huber-White) are the implementation of this in econometric practice. Reporting $I^{-1}$ -based standard errors when the model is misspecified is the most common subtle error in applied likelihood inference.

Failure Mode

Even QMLE requires identifiability of the least-false parameter $\theta^*$ and finite second moments of the score under $P_0$ . Without uniqueness of $\theta^*$ (e.g., the KL surface has multiple equal-depth minima) the MLE has no well-defined limit. Without finite moments (heavy-tailed data and a Gaussian model) the sandwich blows up. Wilks' theorem also fails under misspecification: $-2 \log \Lambda_n$ converges to a weighted sum of $\chi^2_1$ (Vuong 1989) rather than a clean $\chi^2_q$ , so naive LRTs over-reject.

report a correction →

MLE in Exponential Families

Exponential families are where MLE has the most structure: the score is linear in sufficient statistics, the Fisher information is the variance of those statistics, and the MLE coincides with method-of-moments on the canonical parameter.

A $k$ -parameter exponential family in canonical form has density

$p(x; \eta) = h(x) \exp\!\big( \eta^\top T(x) - A(\eta) \big),$

where $T(x) \in \mathbb R^k$ is the sufficient statistic, $\eta \in \mathbb R^k$ is the natural parameter, $A(\eta) = \log \int h(x) \exp(\eta^\top T(x)) \, dx$ is the log-partition function, and $h(x) \geq 0$ is the base measure.

Key identities. $\nabla A(\eta) = \mathbb E_\eta[T(X)] = \mu(\eta)$ (mean of the sufficient statistic). $\nabla^2 A(\eta) = \mathrm{Cov}_\eta(T(X)) = I(\eta)$ (Fisher information equals the covariance of the sufficient statistic, with no Bartlett identity required; it follows from convexity of $A$ ). Score: $s(\eta; x) = T(x) - \mu(\eta)$ .

MLE. The score equation is $\sum_i T(X_i) = n \mu(\hat\eta_n)$ , i.e. $\bar T_n = \mu(\hat\eta_n)$ . The MLE matches the empirical mean of the sufficient statistic to its model-implied mean. This is the moment-matching property and explains why for canonical-form exponential families MoM and MLE coincide.

Existence and uniqueness. The MLE exists if $\bar T_n$ is in the interior of the convex hull of $\{T(x) : h(x) > 0\}$ (Barndorff-Nielsen 1978). It is unique because $A$ is strictly convex, hence $\mu = \nabla A$ is one-to-one.

Examples reduce to bookkeeping. Bernoulli: $T(x) = x$ , $\mu(\eta) = e^\eta/(1+e^\eta) = p$ , score equation $\bar X = \hat p$ . Poisson: $T(x) = x$ , $\mu(\eta) = e^\eta = \lambda$ , $\bar X = \hat\lambda$ . Multivariate Gaussian with known $\Sigma$ : $T(x) = \Sigma^{-1} x$ , MLE $\hat\mu = \bar X$ . Linear exponential families (GLMs with canonical link) inherit the same structure: the MLE solves $X^\top y = X^\top \mu(X \hat\beta)$ .

MLE as Empirical Cross-Entropy Minimization

A perspective that pays off in modern ML:

$\hat\theta_n = \arg\min_\theta \frac{1}{n}\sum_{i=1}^n \big[ -\log p(X_i; \theta) \big],$

i.e. the MLE minimizes the empirical cross-entropy of the data against the model. For discrete models whose support is consistent with the data, this is also Csiszár's information projection of the empirical distribution onto the model, $\hat\theta_n = \arg\min_\theta D_{\mathrm{KL}}(\hat P_n \| p_\theta) + \mathrm{const}$ , where the $\theta$ -independent constant is the empirical entropy. For continuous-density models the strict measure-theoretic KL is infinite (the empirical $\hat P_n$ has mass at $n$ points, while $p_\theta$ is absolutely continuous with respect to Lebesgue), so the "KL projection" reading is heuristic; the cross-entropy reading is the one that holds rigorously, and it is also what cross-entropy training in practice minimizes. Three corollaries:

Cross-entropy training is MLE. Minimizing categorical cross-entropy on classification labels is exactly MLE for a categorical model with class probabilities $\sigma(\mathrm{logits})$ . Every softmax classifier is an MLE.
Perplexity training is MLE. Language model perplexity is $\exp(\mathrm{NLL}/n)$ , so minimizing perplexity equals minimizing NLL equals running MLE on a categorical sequence model. The same holds for autoregressive image and audio models.
ELBO and variational MLE. Variational autoencoders cannot compute the marginal likelihood, so they maximize the evidence lower bound (ELBO), which is a tractable surrogate for MLE. Diffusion models do the same with a multi-step variational bound (the $L_t$ losses sum to an upper bound on NLL). When the variational gap is small, you are running approximate MLE.

The MLE is not a niche statistical procedure; it is the asymptotic objective that every probabilistic deep learning method tries to optimize, exactly or approximately.

Worked Examples

Example

MLE for Gaussian mean (variance known)

$X_1, \ldots, X_n \sim \mathcal N(\mu, \sigma^2)$ with $\sigma^2$ known. Log-likelihood: $\ell(\mu) = -\tfrac n 2 \log(2\pi\sigma^2) - \tfrac{1}{2\sigma^2} \sum_i (X_i - \mu)^2$ . Score: $s(\mu) = \tfrac{1}{\sigma^2} \sum_i (X_i - \mu)$ . Setting it to zero: $\hat\mu = \bar X$ . Fisher information: $I(\mu) = 1/\sigma^2$ . Asymptotic variance: $\sigma^2/n$ . The result holds exactly at every $n$ , not only asymptotically: $\bar X$ is the uniformly minimum-variance unbiased estimator (UMVUE) and attains Cramér-Rao for all sample sizes.

Example

MLE for Gaussian variance

$X_1, \ldots, X_n \sim \mathcal N(\mu, \sigma^2)$ , both unknown. The MLE is $\hat\mu = \bar X$ , $\hat\sigma^2 = n^{-1} \sum_i (X_i - \bar X)^2$ . $\hat\sigma^2$ is biased downward: $\mathbb E[\hat\sigma^2] = \tfrac{n-1}{n} \sigma^2$ . The unbiased estimator divides by $n - 1$ and is a routine bias-correction. Fisher information is block-diagonal: $I(\mu, \sigma^2) = \mathrm{diag}(1/\sigma^2, 1/(2\sigma^4))$ , so the asymptotic variance of $\hat\sigma^2$ is $2\sigma^4/n$ . The MLE is biased but consistent and asymptotically efficient.

Example

MLE for Bernoulli with degenerate sample

$X_1, \ldots, X_{10} \sim \mathrm{Bernoulli}(p)$ with $\sum_i X_i = 0$ . $\ell(p) = 10 \log(1 - p)$ , maximized at $\hat p = 0$ . This is on the boundary of $\Theta = [0, 1]$ , so asymptotic normality fails. The Wald CI $\hat p \pm z \sqrt{\hat p (1-\hat p)/n}$ collapses to a single point, which is exactly zero coverage. The fix is a Bayesian (Beta(1,1) prior gives posterior Beta(1, 11), mean $1/12$ ) or a Wilson / Agresti-Coull / score-based CI that does not degenerate on the boundary. This is the canonical demonstration that MLE alone is insufficient near boundaries.

Example

MLE for a misspecified exponential

$X_i \sim \mathrm{Gamma}(\alpha, \lambda)$ with $\alpha = 2$ (so the data is not exponential), but you fit the exponential family $p(x; \lambda) = \lambda e^{-\lambda x}$ . The exponential MLE is $\hat\lambda = 1/\bar X$ . Under $P_0 = \mathrm{Gamma}(2, \lambda_0)$ : $\mathbb E[X] = 2/\lambda_0$ , so $\hat\lambda \to \lambda_0/2$ , the least-false parameter (it minimizes KL from Gamma(2, $\lambda_0$ ) to Exponential $(\lambda)$ ). The exponential model's Fisher information at $\theta = \lambda_0/2$ is $I = 1/\lambda^2 = 4/\lambda_0^2$ , but the true score variance under $P_0$ is $A = \mathrm{Var}_{P_0}(s) = 4/\lambda_0^2 \cdot (1/2) = 2/\lambda_0^2$ (because the variance of $X$ under Gamma $(2, \lambda_0)$ is $2/\lambda_0^2$ , not $4/\lambda_0^2$ ). The naive variance $I^{-1}/n = \lambda_0^2/(4n)$ is a factor of 2 too large; the sandwich gives $V^{-1} A V^{-\top} = \lambda_0^2/(8n)$ . Reporting $I^{-1}$ -based CIs would over-cover by 50%.

Example

MLE = MoM in canonical exponential families

For a canonical-form exponential family with sufficient statistic $T$ , the score equation is $\bar T_n = \mu(\hat\eta_n)$ . The MLE is obtained by matching the empirical first moment of the sufficient statistic to the model-implied first moment. So MLE coincides with the method of moments on the natural parameter, a fact that explains why GLMs with canonical links have closed-form gradients and why their Newton step is identical to the iteratively reweighted least-squares (IRLS) update.

Common Confusions

Watch Out

The likelihood is not a probability distribution over θ

$L_n(\theta)$ is a function of $\theta$ with the data fixed. It does not integrate to one and is not a posterior. The Bayesian construction multiplies $L_n$ by a prior and renormalizes; without a prior, "the probability of $\theta$ " is not even well-defined under the frequentist worldview that produced MLE. The MLE is the mode of $L_n$ , not the mean, median, or mode of any posterior; those depend on the prior.

Watch Out

MLE is biased in finite samples and that is a separate question from asymptotic efficiency

"Asymptotically efficient" means the asymptotic variance hits the Cramér-Rao bound. It does not mean the finite-sample MLE is unbiased or even minimum-MSE. The Gaussian variance MLE is biased; ridge regression beats OLS in MSE; the James-Stein estimator beats the multivariate Gaussian MLE in MSE for $d \geq 3$ . Asymptotic efficiency and finite-sample optimality are different criteria, and the textbook tradition that conflates them is misleading. See empirical-risk-minimization and bias-variance tradeoff for the modern treatment.

Watch Out

James-Stein: MLE can be inadmissible in d ≥ 3 dimensions

Let $X \sim \mathcal N(\theta, I_d)$ with $d \geq 3$ . The MLE is $\hat\theta_{\mathrm{MLE}} = X$ . The James-Stein estimator $\hat\theta_{\mathrm{JS}} = (1 - (d-2)/\| X \|^2) X$ shrinks $X$ toward zero and has strictly smaller MSE for every $\theta$ . This does not contradict Cramér-Rao because James-Stein is biased. The lesson: in high dimensions, the unbiased MLE is inadmissible, and shrinkage/regularization is not a kludge but a structural improvement. This is the theoretical justification for ridge, LASSO, weight decay, dropout, and every other regularizer in modern ML.

Watch Out

Consistency requires correct specification (or you get the least-false parameter)

If $P_0 \notin \{p_\theta\}$ , the MLE converges to $\theta^* = \arg\min_\theta D_{\mathrm{KL}}(P_0 \| p_\theta)$ , not to any "true" $\theta_0$ (which does not exist). This is still useful (you get the closest model in KL), but the asymptotic variance is the sandwich $V^{-1} A V^{-\top}$ , not $I^{-1}$ , and $-2 \log \Lambda_n$ does not have a clean $\chi^2$ distribution (Vuong 1989). Reporting $I^{-1}$ -based confidence intervals on a misspecified model is anti-conservative and one of the most common subtle errors in applied likelihood work.

Watch Out

MLE existence and uniqueness require structural conditions

The MLE may not exist (e.g., a Gaussian mixture with one component collapsing onto a single data point gives infinite likelihood). It may not be unique (multimodal $\ell_n$ , label switching in mixtures, overparameterized neural networks). Existence and uniqueness require concavity of $\ell_n$ (canonical exponential families), compactness of $\Theta$ , or identifiability, conditions that often need to be imposed or verified explicitly rather than assumed.

Watch Out

Reparameterization invariance is a feature of the MLE, not a bug

If $\eta = g(\theta)$ for a smooth bijection $g$ , the MLE of $\eta$ is $g(\hat\theta_{\mathrm{MLE}})$ . This invariance is what makes MLE-based inference clean across reparameterizations (canonical vs. mean parameter in exponential families, log-odds vs. probability in logistic regression). Bayesian posteriors are not invariant in the same way (the maximum a posteriori estimate transforms by the Jacobian), so the MAP and MLE disagree under nonlinear reparameterization even with a flat prior. This is a well-known wart of MAP estimation that does not afflict MLE.

Optional Extensions: Non-Regular MLE — Where Asymptotic Theory Breaks

Standard MLE asymptotics assume the regularity conditions stated above (smooth log-density on a fixed support, interior parameter, identifiable, finite Fisher information, etc.). When any of these fails, the standard theorems do not apply — and the failure modes are not pathological edge cases. Three are common enough in practice that confusing them with the regular case produces real bugs.

1. Support depends on the parameter — Uniform $[0, \theta]$

The textbook example. Let $X_1, \ldots, X_n \sim \text{Uniform}[0, \theta]$ with $\theta > 0$ unknown. The likelihood is

L_n(\theta) = \prod_{i=1}^n \frac{1}{\theta} \mathbf 1[0 \le X_i \le \theta] = \theta^{-n}\, \mathbf 1[\theta \ge \max_i X_i],

a piecewise function: zero for $\theta < X_{(n)}$ (the maximum), then $\theta^{-n}$ decreasing in $\theta$ for $\theta \ge X_{(n)}$ . The MLE is $\hat\theta_n = X_{(n)} = \max_i X_i$ .

Why standard theory fails. The support $\{x : 0 \le x \le \theta\}$ depends on $\theta$ , so you cannot differentiate under the integral sign in the score-equation derivation. The Fisher information diverges (informally, the likelihood has a discontinuous edge at $\theta = X_{(n)}$ ). The asymptotic-normality theorem assumes a smooth log-likelihood with finite Fisher information; both fail here.

What actually happens. A direct calculation: $\mathbb P(X_{(n)} \le t) = (t/\theta)^n$ for $t \in [0, \theta]$ , so the gap $\theta - X_{(n)}$ has CDF $1 - ((\theta - t)/\theta)^n$ on $[0, \theta]$ . Letting $n(\theta - X_{(n)}) = T_n$ ,

\mathbb P(T_n \le s) = 1 - \left(1 - \frac{s}{n\theta}\right)^n \xrightarrow{n \to \infty} 1 - e^{-s/\theta},

so $T_n \xrightarrow{d} \text{Exp}(1/\theta)$ . The MLE converges at rate $1/n$ (not $1/\sqrt n$ ) to an exponential limit (not Gaussian). The bias is $\mathbb E[\hat\theta_n] = n\theta/(n+1)$ , so $\theta - \mathbb E[\hat\theta_n] = \theta/(n+1)$ , an order of $n^{-1}$ that the bias-corrected estimator $((n+1)/n) X_{(n)}$ removes exactly. Standard $1/\sqrt n$ confidence intervals are wildly wrong; you need the exponential limit law.

Pattern. Whenever the log-density has a discontinuity at a parameter-dependent boundary, expect a non-Gaussian, faster-than- $\sqrt n$ rate. This pattern recurs in change-point detection, threshold autoregressions, and any model where the parameter pins down the support.

2. Boundary parameter — variance components in mixed models

Setup: $y_i = x_i^\top \beta + b_i + \varepsilon_i$ with $b_i \sim \mathcal N(0, \sigma_b^2)$ and $\varepsilon_i \sim \mathcal N(0, \sigma^2)$ independent. The variance component $\sigma_b^2 \ge 0$ is non-negative; the parameter space has a boundary at $\sigma_b^2 = 0$ .

Why standard theory fails. Asymptotic normality of the MLE assumes the true parameter is in the interior of $\Theta$ . When $\sigma_b^2 = 0$ (no random-intercept variation), the true parameter sits on the boundary; the score is constrained to a half-space and the limiting distribution is no longer centered Gaussian.

What actually happens (Self & Liang 1987). Under $H_0: \sigma_b^2 = 0$ , the MLE $\hat\sigma^2_b$ has a mixture distribution: with probability $1/2$ it equals zero exactly (the boundary), and with probability $1/2$ it has a half-normal density above zero. The likelihood-ratio statistic $-2\log\Lambda_n$ converges to a 50:50 mixture of $\chi^2_0$ (point mass at zero) and $\chi^2_1$ , not to $\chi^2_1$ . The naive $\chi^2_1$ test is over-conservative and rejects too rarely.

Practical implication. Any null hypothesis that puts a variance component, mixture weight, or other non-negative parameter at the boundary triggers this. Software that reports $\chi^2_1$ -based p-values for "is the random intercept significant" is conservative — multiplying the reported p-value by $1/2$ is the standard adjustment. The same correction applies in lme4 / nlme variance-component tests when the alternative places one variance at the boundary.

3. Identifiability fails — testing one mixture component vs two

Setup: $X_1, \ldots, X_n \sim p\, \mathcal N(\mu_1, 1) + (1 - p)\, \mathcal N(\mu_2, 1)$ . Test $H_0: p = 1$ (one component, the second is irrelevant) vs $H_1: p \in [0, 1]$ .

Why standard theory fails. Under $H_0$ , the parameter $\mu_2$ is not identified: with $p = 1$ , the data has no information about $\mu_2$ , so the likelihood is flat in that coordinate. Wilks' theorem assumes identifiability of all parameters under both hypotheses; here it fails because the nuisance parameter $\mu_2$ is identified only when $p \ne 1$ .

What actually happens (Davies 1977, 1987). The LRT statistic $-2 \log \Lambda_n$ does not converge to $\chi^2_q$ . The limit is the supremum of a Gaussian process indexed by $\mu_2$ :

-2 \log \Lambda_n \xrightarrow{d} \sup_{\mu_2 \in \mathcal M} Z(\mu_2)^2,

where $Z(\cdot)$ is a mean-zero Gaussian process whose covariance depends on the design and on the assumed distribution of $\mu_2$ under the alternative. This limit has heavier tails than $\chi^2$ , and naive $\chi^2$ -based p-values are anti-conservative (reject too often). Davies' upper-bound formula gives a usable correction; in practice BIC and parametric bootstrap are common alternatives, particularly for selecting the number of components in finite mixture models.

Pattern. Whenever a parameter is only identified under the alternative, Wilks fails and Davies' Gaussian-process supremum is the right limit. This pattern recurs in regime-switching models, hidden-state count selection in HMMs, and any "is this extra structure needed" test.

Why the failure cases compound

The three failures interact. Mixture-component selection is both a non-identifiability problem (Davies) and a boundary-parameter problem (the mixture weight $p$ sits at the boundary $p = 1$ under $H_0$ ). The combined limit is the supremum of a half-normal Gaussian process, with constants that have to be computed numerically. Self & Liang (1987) and Lin (2002) work out cases. The general rule: when in doubt, run a parametric bootstrap; the asymptotic theory is too fragile to trust in every nonstandard setting.

For nuisance-parameter elimination via profile likelihood, see Davison (2003, ch. 4-5); the profile-likelihood ratio inherits the regularity properties of the underlying full-likelihood theory under interior parameters but breaks the same way at boundaries and under non-identification.

Exercises

ExerciseCore

Problem

Compute the MLE and asymptotic variance for the Exponential model: $X_1, \ldots, X_n \sim \mathrm{Exp}(\lambda)$ , density $p(x; \lambda) = \lambda e^{-\lambda x}$ on $x > 0$ . Then derive the asymptotic distribution of $\log \hat\lambda_n$ via the delta method and use it to construct a confidence interval that respects $\lambda > 0$ .

ExerciseCore

Problem

Show that for $X_1, \ldots, X_{10} \sim \mathrm{Bernoulli}(p)$ with $\sum X_i = 0$ , the MLE is $\hat p = 0$ and the symmetric Wald CI degenerates. Compute the Bayesian posterior under a Beta(1, 1) prior, give the posterior mean, and compare to the Wilson score CI $\hat p_W \pm z \sqrt{\hat p_W (1 - \hat p_W)/n + z^2/(4n^2)} / (1 + z^2/n)$ where $\hat p_W = (k + z^2/2)/(n + z^2)$ with $z = 1.96$ .

ExerciseAdvanced

Problem

Prove the Cramér-Rao bound in the multivariate case: if $T(X) \in \mathbb R^d$ is unbiased for $\theta \in \mathbb R^d$ , then $\mathrm{Cov}_\theta(T) \succeq I(\theta)^{-1}$ (single-observation case; the i.i.d. version uses $n I(\theta)$ ). Use the matrix Cauchy-Schwarz: for any random vectors $U, V$ with $\mathrm{Cov}(V)$ invertible, $\mathrm{Cov}(U) \succeq \mathrm{Cov}(U, V) \mathrm{Cov}(V)^{-1} \mathrm{Cov}(U, V)^\top$ .

ExerciseAdvanced

Problem

Logistic regression has log-likelihood $\ell(\beta) = \sum_i [y_i x_i^\top \beta - \log(1 + e^{x_i^\top \beta})]$ for $y_i \in \{0, 1\}$ and $x_i \in \mathbb R^d$ . (a) Derive the score and Fisher information. (b) Show that $\ell$ is concave. (c) Show that the Newton step is identical to one step of iteratively reweighted least squares (IRLS) with weights $w_i = \pi_i(1 - \pi_i)$ where $\pi_i = \sigma(x_i^\top \beta)$ . (d) When does the MLE not exist? (Hint: think about complete linear separation.)

ExerciseResearch

Problem

Wilks' theorem under misspecification. Suppose the true distribution is $P_0$ and you fit two nested model families $\mathcal P_0 \subset \mathcal P_1$ , both potentially misspecified. Let $\theta^*_j$ be the least-false parameter in $\mathcal P_j$ . Show that under a contiguous hypothesis $\theta^*_0 = \theta^*_1$ (so the projections coincide even though neither projects onto $P_0$ ), the asymptotic distribution of $-2 \log \Lambda_n$ is no longer $\chi^2_q$ but a weighted sum of independent $\chi^2_1$ variables with weights given by the eigenvalues of $A V^{-1}$ (Vuong 1989). Derive the weights for the case where $\mathcal P_1$ is a Gaussian regression with $q$ extra covariates and the data is heteroskedastic.

Numerical Optimization in Practice

The MLE rarely has a closed form outside of exponential families. Standard methods:

Newton-Raphson. $\theta_{t+1} = \theta_t - (\nabla^2 \ell_n)^{-1} \nabla \ell_n$ . Quadratic convergence near the optimum but requires the Hessian; can diverge if started far from the truth or near a saddle.
Fisher scoring. Replace the observed Hessian by the expected (Fisher) information: $\theta_{t+1} = \theta_t + I(\theta_t)^{-1} \nabla \ell_n$ . Often more numerically stable than Newton because $I(\theta)$ is positive definite by construction; identical to Newton in canonical-form exponential families.
EM. When the likelihood involves latent variables (mixtures, hidden Markov models, factor analysis), the EM algorithm iterates an expectation step and a maximization step; each iteration is guaranteed to increase $\ell_n$ but converges only to a local optimum.
Stochastic gradient descent. For large datasets and complex models (deep neural networks), full-batch Newton is infeasible. SGD on the per-sample negative log-likelihood is the dominant approach, with adaptive variants (Adam, AdamW) handling ill-conditioning. The connection to the asymptotic theory above is loose: SGD does not converge to the MLE in any clean asymptotic sense without infinite passes, and modern overparameterized models live in the regime where the MLE is not unique (every zero-loss solution is an MLE).

Summary

Definition. $\hat\theta_n = \arg\max_\theta \sum_i \log p(X_i; \theta)$ , the parameter that maximizes the joint density of observed data.
Score. $s(\theta; x) = \nabla_\theta \log p(x; \theta)$ , mean-zero at $\theta_0$ .
Bartlett identities. $\mathbb E[s] = 0$ and $\mathrm{Cov}(s) = I(\theta) = -\mathbb E[\nabla^2 \log p]$ ; the engine behind every classical MLE asymptotic.
Consistency. $\hat\theta_n \xrightarrow{P} \theta_0$ under identifiability + Wald's compactness or concavity.
Asymptotic normality. $\sqrt n (\hat\theta_n - \theta_0) \xrightarrow{d} \mathcal N(0, I^{-1})$ under regularity (or DQM).
Wilks. $-2 \log \Lambda_n \xrightarrow{d} \chi^2_q$ under $H_0$ , non-central $\chi^2_q$ under contiguous alternatives.
Cramér-Rao. $\mathrm{Var}(T) \geq 1/(n I(\theta))$ for unbiased $T$ ; MLE attains this asymptotically.
QMLE. Under misspecification $\hat\theta_n \to \theta^* = \arg\min D_{\mathrm{KL}}(P_0 \| p_\theta)$ , with sandwich variance $V^{-1} A V^{-\top}$ .
Exponential families. Score is $T(x) - \mu(\eta)$ ; MLE = MoM on the canonical parameter; Newton = Fisher scoring = IRLS.
Modern ML. Cross-entropy training, perplexity training, ELBO, diffusion NLL: all are MLE in disguise.
Failure modes. Boundary parameters, non-identifiability, Neyman-Scott, infinite-dimensional families, complete separation in logistic regression.

Related Comparisons

References

Canonical:

van der Vaart, Asymptotic Statistics (Cambridge UP, 1998). Chapter 5 (M- and Z-estimators), Chapter 7 (LAN), Chapter 16 (LRT and Wilks). The single most-used reference for modern likelihood asymptotics.
Lehmann & Casella, Theory of Point Estimation, 2nd ed. (Springer, 1998). Chapter 6 collects regularity conditions, the Cramér-Rao bound, and the MLE asymptotics in classical form.
Casella & Berger, Statistical Inference, 2nd ed. (Duxbury, 2002). Chapters 7 and 10 give the standard graduate-textbook treatment with worked examples.
Cramér, Mathematical Methods of Statistics (Princeton UP, 1946). The original detailed treatment of MLE asymptotics; still readable and clarifying on the regularity conditions.
Wald, Note on the consistency of the maximum likelihood estimate (Annals of Math. Stat., 1949). The classical compactness-based consistency proof.
Le Cam, Asymptotic Methods in Statistical Decision Theory (Springer, 1986). The deepest treatment of LAN, contiguity, and Hájek's convolution theorem; the modern foundation for "MLE is asymptotically optimal".
Bartlett, Approximate confidence intervals II (Biometrika, 1953). The Bartlett identities, derived as cumulant expansions of the log-likelihood.

Current:

Wasserman, All of Statistics (Springer, 2004). Chapters 9-10 give a compact summary; useful for first-pass exposure.
Keener, Theoretical Statistics: Topics for a Core Course (Springer, 2010). Chapters 7-9 build MLE asymptotics with a stronger focus on examples than van der Vaart.
Murphy, Probabilistic Machine Learning: An Introduction (MIT Press, 2022). Chapter 4 covers MLE in the deep-learning context: cross-entropy, NLL, regularization, and the connection to ERM.
White, Estimation, Inference, and Specification Analysis (Cambridge UP, 1994). The book-length treatment of QMLE and sandwich-variance inference for econometrics.
Newey & McFadden, Large Sample Estimation and Hypothesis Testing (Handbook of Econometrics IV, 1994). Modern reformulation of MLE as a special case of M-estimation; explicit treatment of misspecification and IV.
Davison, Statistical Models (Cambridge UP, 2003). Chapters 4-5 cover likelihood theory at the level needed for applied work; particularly good on profile and conditional likelihoods.

Critique:

White, Maximum Likelihood Estimation of Misspecified Models (Econometrica, 1982). The foundational paper on QMLE: shows the MLE converges to the least-false parameter and derives the sandwich variance.
Vuong, Likelihood ratio tests for model selection and non-nested hypotheses (Econometrica, 1989). Shows that Wilks' theorem fails under misspecification and constructs a corrected LRT whose limit is a weighted $\chi^2$ mixture.
Self & Liang, Asymptotic properties of MLE and LRT under nonstandard conditions (JASA, 1987). The boundary-parameter case; the LRT limit becomes a mixture of $\chi^2$ distributions.
Hodges (1951), discussed in van der Vaart Theorem 8.1. The super-efficient estimator that beats the MLE on a measure-zero set, motivating the regular qualifier in Hájek's convolution theorem.
Davies, Hypothesis testing when a nuisance parameter is present only under the alternative (Biometrika, 1977 and 1987). Mixture-component testing where Wilks fails because the nuisance parameter is unidentified under the null.
Stein, Inadmissibility of the usual estimator for the mean of a multivariate normal distribution (Berkeley Symposium, 1956); James & Stein, Estimation with quadratic loss (1961). The original demonstration that MLE is inadmissible in $d \geq 3$ ; the foundational result for modern shrinkage and regularization theory.

Next Topics

Asymptotic statistics: the M-estimator generalization of MLE; sandwich variance, LAN, Wald = score = LRT equivalence.
Cramér-Rao bound and Fisher information: the geometric and information-theoretic background for the asymptotic variance lower bound.
Hypothesis testing for ML: Wald, score, and LRT tests built on MLE asymptotics.
EM algorithm: MLE for latent-variable models when the score equation is intractable.
Empirical risk minimization: the learning-theoretic generalization of MLE to arbitrary loss functions.

Last reviewed: May 4, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

8

Common Probability Distributionslayer 0A · tier 1
Differentiation in Rⁿlayer 0A · tier 1
Exponential Function Propertieslayer 0A · tier 1
Central Limit Theoremlayer 0B · tier 1
Radon-Nikodym and Conditional Expectationlayer 0B · tier 1

Derived topics

31

Asymptotic Statistics: M-Estimators, Delta Method, LANlayer 0B · tier 1
Conjugate Priorslayer 0B · tier 1
Cramér-Rao Bound: Information Inequality, Achievability, and Sharper Variantslayer 0B · tier 1
Fisher Information: Curvature, KL Geometry, and the Natural Gradientlayer 0B · tier 1
Maximum A Posteriori (MAP) Estimationlayer 0B · tier 1

+26 more on the derived-topics page.

Graph-backed continuations

Asymptotic Statistics: M-Estimators, Delta Method, LAN Hypothesis Testing for ML The EM Algorithm Empirical Risk Minimization Cramér-Rao Bound: Information Inequality, Achievability, and Sharper Variants Fisher Information: Curvature, KL Geometry, and the Natural Gradient AIC and BIC Bayesian Estimation Distributional Semantics Double/Debiased Machine Learning E-Values and Anytime-Valid Inference Energy-Based Models Gaussian Mixture Models and EM Linear Regression Logistic Regression Logspline Density Estimation Minimax Lower Bounds: Le Cam, Fano, Assouad, and the Reduction to Testing Neyman-Pearson and Hypothesis Testing Theory REML and Variance Component Estimation Robust Statistics and M-Estimators Shrinkage Estimation and the James-Stein Estimator: Inadmissibility, SURE, and Brown's Characterization Stein's Paradox Sufficient Statistics and Exponential Families Survival Analysis Variational Autoencoders Word Embeddings Bayesian Linear Regression Conjugate Priors Maximum A Posteriori (MAP) Estimation

Why This Matters

Mental Model

Formal Setup and Notation

Score Function and Fisher Information

The Information Identity

Regularity Conditions

Main Theorems

MLE in Exponential Families

MLE as Empirical Cross-Entropy Minimization

Worked Examples

Common Confusions

Optional Extensions: Non-Regular MLE — Where Asymptotic Theory Breaks

1. Support depends on the parameter — Uniform [0,θ][0, \theta][0,θ]

2. Boundary parameter — variance components in mixed models

3. Identifiability fails — testing one mixture component vs two

Why the failure cases compound

Exercises

Numerical Optimization in Practice

Summary

Related Comparisons

References

Next Topics

Required before and derived from this topic

Required prerequisites

Derived topics

1. Support depends on the parameter — Uniform $[0, \theta]$