Skip to main content

Statistical Foundations

Neyman-Pearson and Hypothesis Testing Theory

The likelihood ratio test is the most powerful test for simple hypotheses (Neyman-Pearson lemma), UMP tests extend this to one-sided composites, and the power function characterizes a test's behavior across the parameter space.

CoreTier 2StableSupporting~50 min

Why This Matters

Hypothesis testing is the formal framework for making binary decisions from data: is the drug effective? Is the model better than the baseline? The Neyman-Pearson lemma answers a precise optimization question: among all tests with Type I error at most α\alpha, which test has the highest probability of correctly rejecting a false null hypothesis?

The answer is the likelihood ratio test. This result is the foundation for understanding power analysis, sample size calculations, and the deep connection between hypothesis testing and binary classification.

Formal Setup

Definition

Hypothesis Test

A hypothesis test for H0:θ=θ0H_0: \theta = \theta_0 versus H1:θ=θ1H_1: \theta = \theta_1 is a function ϕ:X[0,1]\phi: \mathcal{X} \to [0, 1] where ϕ(x)\phi(x) is the probability of rejecting H0H_0 given data xx. The size (Type I error rate) is α(ϕ)=Eθ0[ϕ(X)]\alpha(\phi) = \mathbb{E}_{\theta_0}[\phi(X)]. The power against θ1\theta_1 is β(ϕ)=Eθ1[ϕ(X)]\beta(\phi) = \mathbb{E}_{\theta_1}[\phi(X)].

A test ϕ\phi has level α\alpha if Eθ0[ϕ(X)]α\mathbb{E}_{\theta_0}[\phi(X)] \leq \alpha.

Definition

Power Function

The power function of a test ϕ\phi is:

βϕ(θ)=Eθ[ϕ(X)]=Pθ(reject H0)\beta_\phi(\theta) = \mathbb{E}_\theta[\phi(X)] = P_\theta(\text{reject } H_0)

For a good test, βϕ(θ0)α\beta_\phi(\theta_0) \leq \alpha (controlled Type I error) and βϕ(θ)\beta_\phi(\theta) is large for θ\theta far from θ0\theta_0 (high power against alternatives).

Main Theorems

Lemma

Neyman-Pearson Lemma

Statement

For testing H0:θ=θ0H_0: \theta = \theta_0 vs H1:θ=θ1H_1: \theta = \theta_1 at level α\alpha, the test that rejects when the likelihood ratio exceeds a threshold:

ϕ(x)={1if f1(x)f0(x)>kγif f1(x)f0(x)=k0if f1(x)f0(x)<k\phi^*(x) = \begin{cases} 1 & \text{if } \frac{f_1(x)}{f_0(x)} > k \\ \gamma & \text{if } \frac{f_1(x)}{f_0(x)} = k \\ 0 & \text{if } \frac{f_1(x)}{f_0(x)} < k \end{cases}

where kk and γ\gamma are chosen so that Eθ0[ϕ(X)]=α\mathbb{E}_{\theta_0}[\phi^*(X)] = \alpha, is the most powerful level-α\alpha test. That is, for any other test ϕ\phi with Eθ0[ϕ(X)]α\mathbb{E}_{\theta_0}[\phi(X)] \leq \alpha:

Eθ1[ϕ(X)]Eθ1[ϕ(X)]\mathbb{E}_{\theta_1}[\phi^*(X)] \geq \mathbb{E}_{\theta_1}[\phi(X)]

Intuition

The likelihood ratio f1(x)/f0(x)f_1(x)/f_0(x) measures how much more likely the data xx is under H1H_1 than under H0H_0. Rejecting when this ratio is large is the optimal strategy for distinguishing the two hypotheses. Data points where H1H_1 is much more likely than H0H_0 provide the strongest evidence against H0H_0, so the test allocates its rejection budget to these points first.

Proof Sketch

Let ϕ\phi be any level-α\alpha test. We need to show E1[ϕ]E1[ϕ]\mathbb{E}_1[\phi^*] \geq \mathbb{E}_1[\phi]. Write E1[ϕϕ]=(ϕϕ)(f1kf0)dμ+k(ϕϕ)f0dμ\mathbb{E}_1[\phi^* - \phi] = \int (\phi^* - \phi)(f_1 - k f_0) d\mu + k \int (\phi^* - \phi) f_0 d\mu. By construction of ϕ\phi^*: when f1>kf0f_1 > k f_0, ϕ=1ϕ\phi^* = 1 \geq \phi; when f1<kf0f_1 < k f_0, ϕ=0ϕ\phi^* = 0 \leq \phi. So (ϕϕ)(f1kf0)0(\phi^* - \phi)(f_1 - kf_0) \geq 0 everywhere. The second integral is k(αE0[ϕ])0k(\alpha - \mathbb{E}_0[\phi]) \geq 0 since ϕ\phi has level α\leq \alpha. Both terms are nonneg, so E1[ϕ]E1[ϕ]\mathbb{E}_1[\phi^*] \geq \mathbb{E}_1[\phi].

Why It Matters

The Neyman-Pearson lemma is one of the cleanest optimality results in statistics. It says the optimal test statistic is the likelihood ratio, period. Every commonly used test (t-test, z-test, chi-squared test) can be understood as a likelihood ratio test for a specific distributional assumption.

Failure Mode

The lemma applies only to simple hypotheses (point null vs. point alternative). For composite hypotheses (H0:θθ0H_0: \theta \leq \theta_0), the most powerful test depends on which θ1\theta_1 you want power against, and a uniformly most powerful test may not exist.

Uniformly Most Powerful Tests

Theorem

UMP Tests via Monotone Likelihood Ratio

Statement

If the family {fθ}\{f_\theta\} has a monotone likelihood ratio in T(X)T(X) (i.e., fθ1(x)/fθ0(x)f_{\theta_1}(x)/f_{\theta_0}(x) is nondecreasing in T(x)T(x) for θ1>θ0\theta_1 > \theta_0), then for testing H0:θθ0H_0: \theta \leq \theta_0 versus H1:θ>θ0H_1: \theta > \theta_0, the test that rejects for large T(X)T(X):

ϕ(x)={1if T(x)>cγif T(x)=c0if T(x)<c\phi^*(x) = \begin{cases} 1 & \text{if } T(x) > c \\ \gamma & \text{if } T(x) = c \\ 0 & \text{if } T(x) < c \end{cases}

where cc and γ\gamma give size α\alpha, is uniformly most powerful (UMP). It has the highest power against every θ1>θ0\theta_1 > \theta_0 simultaneously.

Intuition

When the likelihood ratio is monotone in T(X)T(X), the Neyman-Pearson test for any specific θ1>θ0\theta_1 > \theta_0 always rejects for large T(X)T(X). Since the test does not depend on which θ1\theta_1 we target, it is simultaneously most powerful against all alternatives on one side. Exponential families always have monotone likelihood ratio in their natural sufficient statistic.

Proof Sketch

For any θ1>θ0\theta_1 > \theta_0, the Neyman-Pearson test rejects when fθ1/fθ0>kf_{\theta_1}/f_{\theta_0} > k. By the monotone likelihood ratio property, this is equivalent to T(x)>c(θ1)T(x) > c(\theta_1). But the size constraint Eθ0[ϕ(X)]=α\mathbb{E}_{\theta_0}[\phi(X)] = \alpha determines cc uniquely, so c(θ1)=cc(\theta_1) = c for all θ1\theta_1. The same test is most powerful for every θ1>θ0\theta_1 > \theta_0.

Why It Matters

UMP tests exist only in restricted settings (one-parameter families with one-sided alternatives). For two-sided alternatives or multiparameter families, UMP tests typically do not exist, and one must settle for locally most powerful or likelihood ratio tests.

Failure Mode

For two-sided alternatives (H1:θθ0H_1: \theta \neq \theta_0), no UMP test exists in general. The Neyman-Pearson test for θ1>θ0\theta_1 > \theta_0 differs from the test for θ1<θ0\theta_1 < \theta_0. Common practice uses the two-sided likelihood ratio test, which is not UMP but is unbiased.

Generalized Likelihood Ratio Test

The Neyman-Pearson lemma assumes simple hypotheses. For composite hypotheses H0:θΘ0H_0: \theta \in \Theta_0 versus H1:θΘH_1: \theta \in \Theta, the generalized likelihood ratio test (GLRT) replaces the two point likelihoods with suprema over each hypothesis.

Definition

Generalized Likelihood Ratio

The GLRT statistic for H0:θΘ0H_0: \theta \in \Theta_0 against H1:θΘH_1: \theta \in \Theta is:

Λ(x)=supθΘ0L(θx)supθΘL(θx)\Lambda(x) = \frac{\sup_{\theta \in \Theta_0} L(\theta \mid x)}{\sup_{\theta \in \Theta} L(\theta \mid x)}

The test rejects H0H_0 when Λ(x)<c\Lambda(x) < c for some threshold cc. Equivalently, it rejects when 2logΛ(x)-2 \log \Lambda(x) is large.

The numerator is the best fit under the null constraint. The denominator is the best fit under the full model. A small ratio means the null constraint costs a lot of likelihood, which is evidence against H0H_0.

Theorem

Wilks' Theorem

Statement

Under H0H_0 and standard regularity, the GLRT statistic satisfies:

2logΛ(X)dχk2-2 \log \Lambda(X) \xrightarrow{d} \chi^2_k

as nn \to \infty, where k=dim(Θ)dim(Θ0)k = \dim(\Theta) - \dim(\Theta_0) is the number of restrictions imposed by the null.

Intuition

Near the true parameter, the log-likelihood is approximately quadratic by a Taylor expansion, with Hessian equal to the Fisher information. The maximum over a kk-dimensional restricted subspace differs from the unrestricted maximum by a quadratic form in a kk-dimensional Gaussian score. That quadratic form is chi-squared with kk degrees of freedom.

Why It Matters

Wilks' theorem gives an asymptotic null distribution for likelihood ratio tests without requiring a closed-form finite-sample distribution. Every regression FF-test, every goodness-of-fit test, and every nested model comparison in practice uses a chi-squared calibration derived from this result.

Failure Mode

Regularity fails when the true parameter lies on the boundary of Θ0\Theta_0 (e.g., testing a variance component equals zero), when the model is non-identified under the null, or when parameters are unbounded. In these cases the limiting distribution is a mixture of chi-squared distributions or something non-standard.

Asymptotic Equivalence of the Three Classical Tests

Three test statistics are commonly used for the same composite null H0:θΘ0H_0: \theta \in \Theta_0 in a regular parametric model:

  • Likelihood ratio (Wilks): 2logΛ(X)-2 \log \Lambda(X)
  • Wald: (θ^θ0)I(θ^)(θ^θ0)(\hat\theta - \theta_0)^\top I(\hat\theta) (\hat\theta - \theta_0)
  • Score (Rao): U(θ0)I(θ0)1U(θ0)U(\theta_0)^\top I(\theta_0)^{-1} U(\theta_0) where UU is the score

Under H0H_0 and regularity, all three converge in distribution to χk2\chi^2_k with kk equal to the codimension of Θ0\Theta_0. They differ at finite samples: Wald uses the unrestricted MLE, the score test uses only the restricted estimator, and the LRT uses both. Under local alternatives θn=θ0+h/n\theta_n = \theta_0 + h / \sqrt{n}, the three statistics have the same noncentral χk2(λ)\chi^2_k(\lambda) limit with noncentrality λ=hI(θ0)h\lambda = h^\top I(\theta_0) h, but can disagree sharply in the non-local regime and under misspecification. See also asymptotic relative efficiency for how these comparisons extend to general test sequences.

Sample Size and Power Calculation

Before running an experiment, one fixes the Type I error α\alpha and the minimum effect δ\delta worth detecting, then asks: what sample size nn is needed to achieve power 1β1 - \beta?

For a one-sided zz-test of H0:μ=μ0H_0: \mu = \mu_0 versus H1:μ=μ1H_1: \mu = \mu_1 with known variance σ2\sigma^2 and effect size δ=μ1μ0>0\delta = \mu_1 - \mu_0 > 0, the power of the Neyman-Pearson test at sample size nn is:

1β=Φ(δnσz1α)1 - \beta = \Phi\left(\frac{\delta \sqrt{n}}{\sigma} - z_{1-\alpha}\right)

Setting this equal to 1β1 - \beta and solving for nn gives:

n((z1α+z1β)σδ)2n \approx \left(\frac{(z_{1-\alpha} + z_{1-\beta})\, \sigma}{\delta}\right)^2

This is the operational formula used in power analysis. Three readings:

  • Required nn scales as 1/δ21/\delta^2. Halving the detectable effect quadruples the sample size.
  • Required nn scales with σ2\sigma^2. Variance reduction (blocking, covariates, paired designs) pays off quadratically.
  • The factor (z1α+z1β)2(z_{1-\alpha} + z_{1-\beta})^2 is about 7.857.85 for α=0.05\alpha = 0.05 and 1β=0.801 - \beta = 0.80, and about 10.5110.51 for α=0.05\alpha = 0.05 and 1β=0.901 - \beta = 0.90. Moving from 80% to 90% power costs roughly 34% more observations.

For a two-sided test, replace z1αz_{1-\alpha} with z1α/2z_{1-\alpha/2}.

Example

Clinical trial sizing

A trial compares a new drug to placebo on a continuous outcome with σ=10\sigma = 10. The minimum clinically meaningful effect is δ=2\delta = 2. Targeting α=0.05\alpha = 0.05 (one-sided) and 1β=0.901 - \beta = 0.90:

n((1.645+1.282)102)2214n \approx \left(\frac{(1.645 + 1.282) \cdot 10}{2}\right)^2 \approx 214

per arm. Shrinking the detectable effect to δ=1\delta = 1 would require about 857857 per arm.

Connection to Binary Classification

Hypothesis testing and binary classification solve the same problem: given an observation xx, decide between two classes. The Neyman-Pearson lemma says the optimal decision boundary is a level set of the likelihood ratio f1(x)/f0(x)f_1(x)/f_0(x). This is equivalent to the Bayes-optimal classifier when the class priors are adjusted to match the significance level.

Specifically: the ROC curve of the likelihood ratio classifier dominates the ROC curve of any other classifier. Every point on the ROC curve corresponds to a Neyman-Pearson test at a different level α\alpha.

Common Confusions

Watch Out

Power is not 1 minus the p-value

The p-value is a random variable computed from data. Power is a fixed property of the test design, computed before seeing data. Power is Pθ1(reject H0)P_{\theta_1}(\text{reject } H_0) for a specific alternative θ1\theta_1. The p-value is Pθ0(TTobs)P_{\theta_0}(T \geq T_{\text{obs}}). They measure different things.

Watch Out

A test can be most powerful and still have low power

The Neyman-Pearson lemma says the likelihood ratio test is the best among all level-α\alpha tests. It does not say the power is high. If the sample size is small or θ1\theta_1 is close to θ0\theta_0, even the most powerful test may have low power. "Most powerful" is a relative statement, not an absolute one.

Canonical Examples

Example

Testing a Gaussian mean

Let X1,,XnN(μ,σ2)X_1, \ldots, X_n \sim \mathcal{N}(\mu, \sigma^2) with known σ2\sigma^2. Test H0:μ=0H_0: \mu = 0 vs H1:μ=1H_1: \mu = 1. The likelihood ratio is exp(nXˉ/σ2n/(2σ2))\exp(n\bar{X}/\sigma^2 - n/(2\sigma^2)), which is monotone increasing in Xˉ\bar{X}. The Neyman-Pearson test rejects when Xˉ>σzα/n\bar{X} > \sigma z_\alpha / \sqrt{n} where zαz_\alpha is the standard normal quantile. Power at μ=1\mu = 1 is Φ(n/σzα)\Phi(\sqrt{n}/\sigma - z_\alpha). For n=25n = 25 and σ=2\sigma = 2, power is Φ(2.51.645)Φ(0.855)0.80\Phi(2.5 - 1.645) \approx \Phi(0.855) \approx 0.80.

Summary

  • The Neyman-Pearson lemma: the likelihood ratio test is the most powerful test for simple hypotheses
  • UMP tests exist for one-sided alternatives in exponential families via the monotone likelihood ratio
  • The power function βϕ(θ)\beta_\phi(\theta) characterizes a test across the entire parameter space
  • The ROC curve of the likelihood ratio classifier dominates all other classifiers
  • UMP tests do not exist for two-sided alternatives in general

Exercises

ExerciseCore

Problem

Let XBernoulli(p)X \sim \text{Bernoulli}(p) with a single observation. For testing H0:p=0.5H_0: p = 0.5 vs H1:p=0.8H_1: p = 0.8 at level α=0.5\alpha = 0.5, write down the Neyman-Pearson test and compute its power.

ExerciseAdvanced

Problem

Prove that for testing H0:μ=0H_0: \mu = 0 vs H1:μ0H_1: \mu \neq 0 with XN(μ,1)X \sim \mathcal{N}(\mu, 1) (single observation), no UMP level-α\alpha test exists.

References

Canonical:

  • Lehmann & Romano, Testing Statistical Hypotheses (3rd ed., 2005), Chapters 3-4
  • Casella & Berger, Statistical Inference (2nd ed., 2002), Chapter 8

Current:

  • Wasserman, All of Statistics (2004), Chapter 10

  • van der Vaart, Asymptotic Statistics (1998), Chapters 14-15 (LAN, asymptotic testing, LRT, Wilks, Wald, Rao)

  • Keener, Theoretical Statistics (2010), Chapters 3-8

Next Topics

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.