Neyman-Pearson and Hypothesis Testing Theory

Sneiderman, Robby

Statistical Foundations

Neyman-Pearson and Hypothesis Testing Theory

The likelihood ratio test is the most powerful test for simple hypotheses (Neyman-Pearson lemma), UMP tests extend this to one-sided composites, and the power function characterizes a test's behavior across the parameter space.

CoreTier 2StableSupporting~50 min

Prerequisites

Common Probability Distributions Maximum Likelihood Estimation

Quiz (6)Pulse Check Prereq Map

Why This Matters

Hypothesis testing is the formal framework for making binary decisions from data: is the drug effective? Is the model better than the baseline? The Neyman-Pearson lemma answers a precise optimization question: among all tests with Type I error at most $\alpha$ , which test has the highest probability of correctly rejecting a false null hypothesis?

The answer is the likelihood ratio test. This result is the foundation for understanding power analysis, sample size calculations, and the deep connection between hypothesis testing and binary classification.

Formal Setup

Definition

Hypothesis Test

A hypothesis test for $H_0: \theta = \theta_0$ versus $H_1: \theta = \theta_1$ is a function $\phi: \mathcal{X} \to [0, 1]$ where $\phi(x)$ is the probability of rejecting $H_0$ given data $x$ . The size (Type I error rate) is $\alpha(\phi) = \mathbb{E}_{\theta_0}[\phi(X)]$ . The power against $\theta_1$ is $\beta(\phi) = \mathbb{E}_{\theta_1}[\phi(X)]$ .

A test $\phi$ has level $\alpha$ if $\mathbb{E}_{\theta_0}[\phi(X)] \leq \alpha$ .

Definition

Power Function $β_{ϕ} (θ)$

The power function of a test $\phi$ is:

$\beta_\phi(\theta) = \mathbb{E}_\theta[\phi(X)] = P_\theta(\text{reject } H_0)$

For a good test, $\beta_\phi(\theta_0) \leq \alpha$ (controlled Type I error) and $\beta_\phi(\theta)$ is large for $\theta$ far from $\theta_0$ (high power against alternatives).

Main Theorems

Lemma

Neyman-Pearson Lemma

Statement

For testing $H_0: \theta = \theta_0$ vs $H_1: \theta = \theta_1$ at level $\alpha$ , the test that rejects when the likelihood ratio exceeds a threshold:

$\phi^*(x) = \begin{cases} 1 & \text{if } \frac{f_1(x)}{f_0(x)} > k \\ \gamma & \text{if } \frac{f_1(x)}{f_0(x)} = k \\ 0 & \text{if } \frac{f_1(x)}{f_0(x)} < k \end{cases}$

where $k$ and $\gamma$ are chosen so that $\mathbb{E}_{\theta_0}[\phi^*(X)] = \alpha$ , is the most powerful level- $\alpha$ test. That is, for any other test $\phi$ with $\mathbb{E}_{\theta_0}[\phi(X)] \leq \alpha$ :

$\mathbb{E}_{\theta_1}[\phi^*(X)] \geq \mathbb{E}_{\theta_1}[\phi(X)]$

Intuition

The likelihood ratio $f_1(x)/f_0(x)$ measures how much more likely the data $x$ is under $H_1$ than under $H_0$ . Rejecting when this ratio is large is the optimal strategy for distinguishing the two hypotheses. Data points where $H_1$ is much more likely than $H_0$ provide the strongest evidence against $H_0$ , so the test allocates its rejection budget to these points first.

Proof Sketch

Let $\phi$ be any level- $\alpha$ test. We need to show $\mathbb{E}_1[\phi^*] \geq \mathbb{E}_1[\phi]$ . Write $\mathbb{E}_1[\phi^* - \phi] = \int (\phi^* - \phi)(f_1 - k f_0) d\mu + k \int (\phi^* - \phi) f_0 d\mu$ . By construction of $\phi^*$ : when $f_1 > k f_0$ , $\phi^* = 1 \geq \phi$ ; when $f_1 < k f_0$ , $\phi^* = 0 \leq \phi$ . So $(\phi^* - \phi)(f_1 - kf_0) \geq 0$ everywhere. The second integral is $k(\alpha - \mathbb{E}_0[\phi]) \geq 0$ since $\phi$ has level $\leq \alpha$ . Both terms are nonneg, so $\mathbb{E}_1[\phi^*] \geq \mathbb{E}_1[\phi]$ .

Why It Matters

The Neyman-Pearson lemma is one of the cleanest optimality results in statistics. It says the optimal test statistic is the likelihood ratio, period. Every commonly used test (t-test, z-test, chi-squared test) can be understood as a likelihood ratio test for a specific distributional assumption.

Failure Mode

The lemma applies only to simple hypotheses (point null vs. point alternative). For composite hypotheses ( $H_0: \theta \leq \theta_0$ ), the most powerful test depends on which $\theta_1$ you want power against, and a uniformly most powerful test may not exist.

report a correction →

Uniformly Most Powerful Tests

Theorem

UMP Tests via Monotone Likelihood Ratio

Statement

If the family $\{f_\theta\}$ has a monotone likelihood ratio in $T(X)$ (i.e., $f_{\theta_1}(x)/f_{\theta_0}(x)$ is nondecreasing in $T(x)$ for $\theta_1 > \theta_0$ ), then for testing $H_0: \theta \leq \theta_0$ versus $H_1: \theta > \theta_0$ , the test that rejects for large $T(X)$ :

$\phi^*(x) = \begin{cases} 1 & \text{if } T(x) > c \\ \gamma & \text{if } T(x) = c \\ 0 & \text{if } T(x) < c \end{cases}$

where $c$ and $\gamma$ give size $\alpha$ , is uniformly most powerful (UMP). It has the highest power against every $\theta_1 > \theta_0$ simultaneously.

Intuition

When the likelihood ratio is monotone in $T(X)$ , the Neyman-Pearson test for any specific $\theta_1 > \theta_0$ always rejects for large $T(X)$ . Since the test does not depend on which $\theta_1$ we target, it is simultaneously most powerful against all alternatives on one side. Exponential families always have monotone likelihood ratio in their natural sufficient statistic.

Proof Sketch

For any $\theta_1 > \theta_0$ , the Neyman-Pearson test rejects when $f_{\theta_1}/f_{\theta_0} > k$ . By the monotone likelihood ratio property, this is equivalent to $T(x) > c(\theta_1)$ . But the size constraint $\mathbb{E}_{\theta_0}[\phi(X)] = \alpha$ determines $c$ uniquely, so $c(\theta_1) = c$ for all $\theta_1$ . The same test is most powerful for every $\theta_1 > \theta_0$ .

Why It Matters

UMP tests exist only in restricted settings (one-parameter families with one-sided alternatives). For two-sided alternatives or multiparameter families, UMP tests typically do not exist, and one must settle for locally most powerful or likelihood ratio tests.

Failure Mode

For two-sided alternatives ( $H_1: \theta \neq \theta_0$ ), no UMP test exists in general. The Neyman-Pearson test for $\theta_1 > \theta_0$ differs from the test for $\theta_1 < \theta_0$ . Common practice uses the two-sided likelihood ratio test, which is not UMP but is unbiased.

report a correction →

Generalized Likelihood Ratio Test

The Neyman-Pearson lemma assumes simple hypotheses. For composite hypotheses $H_0: \theta \in \Theta_0$ versus $H_1: \theta \in \Theta$ , the generalized likelihood ratio test (GLRT) replaces the two point likelihoods with suprema over each hypothesis.

Definition

Generalized Likelihood Ratio $Λ (x)$

The GLRT statistic for $H_0: \theta \in \Theta_0$ against $H_1: \theta \in \Theta$ is:

$\Lambda(x) = \frac{\sup_{\theta \in \Theta_0} L(\theta \mid x)}{\sup_{\theta \in \Theta} L(\theta \mid x)}$

The test rejects $H_0$ when $\Lambda(x) < c$ for some threshold $c$ . Equivalently, it rejects when $-2 \log \Lambda(x)$ is large.

The numerator is the best fit under the null constraint. The denominator is the best fit under the full model. A small ratio means the null constraint costs a lot of likelihood, which is evidence against $H_0$ .

Theorem

Wilks' Theorem

Statement

Under $H_0$ and standard regularity, the GLRT statistic satisfies:

$-2 \log \Lambda(X) \xrightarrow{d} \chi^2_k$

as $n \to \infty$ , where $k = \dim(\Theta) - \dim(\Theta_0)$ is the number of restrictions imposed by the null.

Intuition

Near the true parameter, the log-likelihood is approximately quadratic by a Taylor expansion, with Hessian equal to the Fisher information. The maximum over a $k$ -dimensional restricted subspace differs from the unrestricted maximum by a quadratic form in a $k$ -dimensional Gaussian score. That quadratic form is chi-squared with $k$ degrees of freedom.

Why It Matters

Wilks' theorem gives an asymptotic null distribution for likelihood ratio tests without requiring a closed-form finite-sample distribution. Every regression $F$ -test, every goodness-of-fit test, and every nested model comparison in practice uses a chi-squared calibration derived from this result.

Failure Mode

Regularity fails when the true parameter lies on the boundary of $\Theta_0$ (e.g., testing a variance component equals zero), when the model is non-identified under the null, or when parameters are unbounded. In these cases the limiting distribution is a mixture of chi-squared distributions or something non-standard.

report a correction →

Asymptotic Equivalence of the Three Classical Tests

Three test statistics are commonly used for the same composite null $H_0: \theta \in \Theta_0$ in a regular parametric model:

Likelihood ratio (Wilks): $-2 \log \Lambda(X)$
Wald: $(\hat\theta - \theta_0)^\top I(\hat\theta) (\hat\theta - \theta_0)$
Score (Rao): $U(\theta_0)^\top I(\theta_0)^{-1} U(\theta_0)$ where $U$ is the score

Under $H_0$ and regularity, all three converge in distribution to $\chi^2_k$ with $k$ equal to the codimension of $\Theta_0$ . They differ at finite samples: Wald uses the unrestricted MLE, the score test uses only the restricted estimator, and the LRT uses both. Under local alternatives $\theta_n = \theta_0 + h / \sqrt{n}$ , the three statistics have the same noncentral $\chi^2_k(\lambda)$ limit with noncentrality $\lambda = h^\top I(\theta_0) h$ , but can disagree sharply in the non-local regime and under misspecification. See also asymptotic relative efficiency for how these comparisons extend to general test sequences.

Sample Size and Power Calculation

Before running an experiment, one fixes the Type I error $\alpha$ and the minimum effect $\delta$ worth detecting, then asks: what sample size $n$ is needed to achieve power $1 - \beta$ ?

For a one-sided $z$ -test of $H_0: \mu = \mu_0$ versus $H_1: \mu = \mu_1$ with known variance $\sigma^2$ and effect size $\delta = \mu_1 - \mu_0 > 0$ , the power of the Neyman-Pearson test at sample size $n$ is:

$1 - \beta = \Phi\left(\frac{\delta \sqrt{n}}{\sigma} - z_{1-\alpha}\right)$

Setting this equal to $1 - \beta$ and solving for $n$ gives:

$n \approx \left(\frac{(z_{1-\alpha} + z_{1-\beta})\, \sigma}{\delta}\right)^2$

This is the operational formula used in power analysis. Three readings:

Required $n$ scales as $1/\delta^2$ . Halving the detectable effect quadruples the sample size.
Required $n$ scales with $\sigma^2$ . Variance reduction (blocking, covariates, paired designs) pays off quadratically.
The factor $(z_{1-\alpha} + z_{1-\beta})^2$ is about $7.85$ for $\alpha = 0.05$ and $1 - \beta = 0.80$ , and about $10.51$ for $\alpha = 0.05$ and $1 - \beta = 0.90$ . Moving from 80% to 90% power costs roughly 34% more observations.

For a two-sided test, replace $z_{1-\alpha}$ with $z_{1-\alpha/2}$ .

Example

Clinical trial sizing

A trial compares a new drug to placebo on a continuous outcome with $\sigma = 10$ . The minimum clinically meaningful effect is $\delta = 2$ . Targeting $\alpha = 0.05$ (one-sided) and $1 - \beta = 0.90$ :

$n \approx \left(\frac{(1.645 + 1.282) \cdot 10}{2}\right)^2 \approx 214$

per arm. Shrinking the detectable effect to $\delta = 1$ would require about $857$ per arm.

Connection to Binary Classification

Hypothesis testing and binary classification solve the same problem: given an observation $x$ , decide between two classes. The Neyman-Pearson lemma says the optimal decision boundary is a level set of the likelihood ratio $f_1(x)/f_0(x)$ . This is equivalent to the Bayes-optimal classifier when the class priors are adjusted to match the significance level.

Specifically: the ROC curve of the likelihood ratio classifier dominates the ROC curve of any other classifier. Every point on the ROC curve corresponds to a Neyman-Pearson test at a different level $\alpha$ .

Common Confusions

Watch Out

Power is not 1 minus the p-value

The p-value is a random variable computed from data. Power is a fixed property of the test design, computed before seeing data. Power is $P_{\theta_1}(\text{reject } H_0)$ for a specific alternative $\theta_1$ . The p-value is $P_{\theta_0}(T \geq T_{\text{obs}})$ . They measure different things.

Watch Out

A test can be most powerful and still have low power

The Neyman-Pearson lemma says the likelihood ratio test is the best among all level- $\alpha$ tests. It does not say the power is high. If the sample size is small or $\theta_1$ is close to $\theta_0$ , even the most powerful test may have low power. "Most powerful" is a relative statement, not an absolute one.

Canonical Examples

Example

Testing a Gaussian mean

Let $X_1, \ldots, X_n \sim \mathcal{N}(\mu, \sigma^2)$ with known $\sigma^2$ . Test $H_0: \mu = 0$ vs $H_1: \mu = 1$ . The likelihood ratio is $\exp(n\bar{X}/\sigma^2 - n/(2\sigma^2))$ , which is monotone increasing in $\bar{X}$ . The Neyman-Pearson test rejects when $\bar{X} > \sigma z_\alpha / \sqrt{n}$ where $z_\alpha$ is the standard normal quantile. Power at $\mu = 1$ is $\Phi(\sqrt{n}/\sigma - z_\alpha)$ . For $n = 25$ and $\sigma = 2$ , power is $\Phi(2.5 - 1.645) \approx \Phi(0.855) \approx 0.80$ .

Summary

The Neyman-Pearson lemma: the likelihood ratio test is the most powerful test for simple hypotheses
UMP tests exist for one-sided alternatives in exponential families via the monotone likelihood ratio
The power function $\beta_\phi(\theta)$ characterizes a test across the entire parameter space
The ROC curve of the likelihood ratio classifier dominates all other classifiers
UMP tests do not exist for two-sided alternatives in general

Exercises

ExerciseCore

Problem

Let $X \sim \text{Bernoulli}(p)$ with a single observation. For testing $H_0: p = 0.5$ vs $H_1: p = 0.8$ at level $\alpha = 0.5$ , write down the Neyman-Pearson test and compute its power.

ExerciseAdvanced

Problem

Prove that for testing $H_0: \mu = 0$ vs $H_1: \mu \neq 0$ with $X \sim \mathcal{N}(\mu, 1)$ (single observation), no UMP level- $\alpha$ test exists.

References

Canonical:

Lehmann & Romano, Testing Statistical Hypotheses (3rd ed., 2005), Chapters 3-4
Casella & Berger, Statistical Inference (2nd ed., 2002), Chapter 8

Current:

Wasserman, All of Statistics (2004), Chapter 10
van der Vaart, Asymptotic Statistics (1998), Chapters 14-15 (LAN, asymptotic testing, LRT, Wilks, Wald, Rao)
Keener, Theoretical Statistics (2010), Chapters 3-8

Next Topics

Hypothesis testing for ML: multiple testing, A/B testing, and model comparison
Bootstrap methods: nonparametric alternatives to parametric tests

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

Common Probability Distributionslayer 0A · tier 1
Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiencylayer 0B · tier 1

Derived topics

5

Bootstrap Methodslayer 2 · tier 1
Likelihood-Ratio, Wald, and Score Testslayer 2 · tier 1
Permutation Testslayer 2 · tier 1
E-Values and Anytime-Valid Inferencelayer 3 · tier 1
Hypothesis Testing for MLlayer 2 · tier 2

Graph-backed continuations

Hypothesis Testing for ML Bootstrap Methods E-Values and Anytime-Valid Inference