Statistical Foundations
Neyman-Pearson and Hypothesis Testing Theory
The likelihood ratio test is the most powerful test for simple hypotheses (Neyman-Pearson lemma), UMP tests extend this to one-sided composites, and the power function characterizes a test's behavior across the parameter space.
Why This Matters
Hypothesis testing is the formal framework for making binary decisions from data: is the drug effective? Is the model better than the baseline? The Neyman-Pearson lemma answers a precise optimization question: among all tests with Type I error at most , which test has the highest probability of correctly rejecting a false null hypothesis?
The answer is the likelihood ratio test. This result is the foundation for understanding power analysis, sample size calculations, and the deep connection between hypothesis testing and binary classification.
Formal Setup
Hypothesis Test
A hypothesis test for versus is a function where is the probability of rejecting given data . The size (Type I error rate) is . The power against is .
A test has level if .
Power Function
The power function of a test is:
For a good test, (controlled Type I error) and is large for far from (high power against alternatives).
Main Theorems
Neyman-Pearson Lemma
Statement
For testing vs at level , the test that rejects when the likelihood ratio exceeds a threshold:
where and are chosen so that , is the most powerful level- test. That is, for any other test with :
Intuition
The likelihood ratio measures how much more likely the data is under than under . Rejecting when this ratio is large is the optimal strategy for distinguishing the two hypotheses. Data points where is much more likely than provide the strongest evidence against , so the test allocates its rejection budget to these points first.
Proof Sketch
Let be any level- test. We need to show . Write . By construction of : when , ; when , . So everywhere. The second integral is since has level . Both terms are nonneg, so .
Why It Matters
The Neyman-Pearson lemma is one of the cleanest optimality results in statistics. It says the optimal test statistic is the likelihood ratio, period. Every commonly used test (t-test, z-test, chi-squared test) can be understood as a likelihood ratio test for a specific distributional assumption.
Failure Mode
The lemma applies only to simple hypotheses (point null vs. point alternative). For composite hypotheses (), the most powerful test depends on which you want power against, and a uniformly most powerful test may not exist.
Uniformly Most Powerful Tests
UMP Tests via Monotone Likelihood Ratio
Statement
If the family has a monotone likelihood ratio in (i.e., is nondecreasing in for ), then for testing versus , the test that rejects for large :
where and give size , is uniformly most powerful (UMP). It has the highest power against every simultaneously.
Intuition
When the likelihood ratio is monotone in , the Neyman-Pearson test for any specific always rejects for large . Since the test does not depend on which we target, it is simultaneously most powerful against all alternatives on one side. Exponential families always have monotone likelihood ratio in their natural sufficient statistic.
Proof Sketch
For any , the Neyman-Pearson test rejects when . By the monotone likelihood ratio property, this is equivalent to . But the size constraint determines uniquely, so for all . The same test is most powerful for every .
Why It Matters
UMP tests exist only in restricted settings (one-parameter families with one-sided alternatives). For two-sided alternatives or multiparameter families, UMP tests typically do not exist, and one must settle for locally most powerful or likelihood ratio tests.
Failure Mode
For two-sided alternatives (), no UMP test exists in general. The Neyman-Pearson test for differs from the test for . Common practice uses the two-sided likelihood ratio test, which is not UMP but is unbiased.
Generalized Likelihood Ratio Test
The Neyman-Pearson lemma assumes simple hypotheses. For composite hypotheses versus , the generalized likelihood ratio test (GLRT) replaces the two point likelihoods with suprema over each hypothesis.
Generalized Likelihood Ratio
The GLRT statistic for against is:
The test rejects when for some threshold . Equivalently, it rejects when is large.
The numerator is the best fit under the null constraint. The denominator is the best fit under the full model. A small ratio means the null constraint costs a lot of likelihood, which is evidence against .
Wilks' Theorem
Statement
Under and standard regularity, the GLRT statistic satisfies:
as , where is the number of restrictions imposed by the null.
Intuition
Near the true parameter, the log-likelihood is approximately quadratic by a Taylor expansion, with Hessian equal to the Fisher information. The maximum over a -dimensional restricted subspace differs from the unrestricted maximum by a quadratic form in a -dimensional Gaussian score. That quadratic form is chi-squared with degrees of freedom.
Why It Matters
Wilks' theorem gives an asymptotic null distribution for likelihood ratio tests without requiring a closed-form finite-sample distribution. Every regression -test, every goodness-of-fit test, and every nested model comparison in practice uses a chi-squared calibration derived from this result.
Failure Mode
Regularity fails when the true parameter lies on the boundary of (e.g., testing a variance component equals zero), when the model is non-identified under the null, or when parameters are unbounded. In these cases the limiting distribution is a mixture of chi-squared distributions or something non-standard.
Asymptotic Equivalence of the Three Classical Tests
Three test statistics are commonly used for the same composite null in a regular parametric model:
- Likelihood ratio (Wilks):
- Wald:
- Score (Rao): where is the score
Under and regularity, all three converge in distribution to with equal to the codimension of . They differ at finite samples: Wald uses the unrestricted MLE, the score test uses only the restricted estimator, and the LRT uses both. Under local alternatives , the three statistics have the same noncentral limit with noncentrality , but can disagree sharply in the non-local regime and under misspecification. See also asymptotic relative efficiency for how these comparisons extend to general test sequences.
Sample Size and Power Calculation
Before running an experiment, one fixes the Type I error and the minimum effect worth detecting, then asks: what sample size is needed to achieve power ?
For a one-sided -test of versus with known variance and effect size , the power of the Neyman-Pearson test at sample size is:
Setting this equal to and solving for gives:
This is the operational formula used in power analysis. Three readings:
- Required scales as . Halving the detectable effect quadruples the sample size.
- Required scales with . Variance reduction (blocking, covariates, paired designs) pays off quadratically.
- The factor is about for and , and about for and . Moving from 80% to 90% power costs roughly 34% more observations.
For a two-sided test, replace with .
Clinical trial sizing
A trial compares a new drug to placebo on a continuous outcome with . The minimum clinically meaningful effect is . Targeting (one-sided) and :
per arm. Shrinking the detectable effect to would require about per arm.
Connection to Binary Classification
Hypothesis testing and binary classification solve the same problem: given an observation , decide between two classes. The Neyman-Pearson lemma says the optimal decision boundary is a level set of the likelihood ratio . This is equivalent to the Bayes-optimal classifier when the class priors are adjusted to match the significance level.
Specifically: the ROC curve of the likelihood ratio classifier dominates the ROC curve of any other classifier. Every point on the ROC curve corresponds to a Neyman-Pearson test at a different level .
Common Confusions
Power is not 1 minus the p-value
The p-value is a random variable computed from data. Power is a fixed property of the test design, computed before seeing data. Power is for a specific alternative . The p-value is . They measure different things.
A test can be most powerful and still have low power
The Neyman-Pearson lemma says the likelihood ratio test is the best among all level- tests. It does not say the power is high. If the sample size is small or is close to , even the most powerful test may have low power. "Most powerful" is a relative statement, not an absolute one.
Canonical Examples
Testing a Gaussian mean
Let with known . Test vs . The likelihood ratio is , which is monotone increasing in . The Neyman-Pearson test rejects when where is the standard normal quantile. Power at is . For and , power is .
Summary
- The Neyman-Pearson lemma: the likelihood ratio test is the most powerful test for simple hypotheses
- UMP tests exist for one-sided alternatives in exponential families via the monotone likelihood ratio
- The power function characterizes a test across the entire parameter space
- The ROC curve of the likelihood ratio classifier dominates all other classifiers
- UMP tests do not exist for two-sided alternatives in general
Exercises
Problem
Let with a single observation. For testing vs at level , write down the Neyman-Pearson test and compute its power.
Problem
Prove that for testing vs with (single observation), no UMP level- test exists.
References
Canonical:
- Lehmann & Romano, Testing Statistical Hypotheses (3rd ed., 2005), Chapters 3-4
- Casella & Berger, Statistical Inference (2nd ed., 2002), Chapter 8
Current:
-
Wasserman, All of Statistics (2004), Chapter 10
-
van der Vaart, Asymptotic Statistics (1998), Chapters 14-15 (LAN, asymptotic testing, LRT, Wilks, Wald, Rao)
-
Keener, Theoretical Statistics (2010), Chapters 3-8
Next Topics
- Hypothesis testing for ML: multiple testing, A/B testing, and model comparison
- Bootstrap methods: nonparametric alternatives to parametric tests
Last reviewed: April 26, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
2- Common Probability Distributionslayer 0A · tier 1
- Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiencylayer 0B · tier 1
Derived topics
5- Bootstrap Methodslayer 2 · tier 1
- Likelihood-Ratio, Wald, and Score Testslayer 2 · tier 1
- Permutation Testslayer 2 · tier 1
- E-Values and Anytime-Valid Inferencelayer 3 · tier 1
- Hypothesis Testing for MLlayer 2 · tier 2
Graph-backed continuations