Methodology
Hypothesis Testing for ML
Null and alternative hypotheses, p-values, confidence intervals, error types, multiple comparisons, and proper statistical tests for comparing ML models.
Prerequisites
Why This Matters
Machine learning papers routinely claim that model A outperforms model B. But how do you know the difference is real and not an artifact of the random train/test split, initialization, or data ordering? Hypothesis testing provides the formal framework for answering this question. Without it, you cannot distinguish signal from noise in experimental comparisons. The multiple comparisons problem is especially critical in ML, where researchers often compare many models, hyperparameter settings, or metrics simultaneously.
Mental Model
You have two models. You run them on a test set and model A scores 0.85 while model B scores 0.83. Is A better than B, or did you just get lucky? Hypothesis testing formalizes this question: assume the models are equally good (null hypothesis), compute how unlikely the observed difference is under that assumption (p-value), and reject the null if the data is sufficiently surprising.
The Neyman-Pearson Framework
Null Hypothesis
The null hypothesis is the default assumption you are trying to disprove. In ML model comparison:
where and are the true expected performances of models A and B. The null says there is no real difference between the models.
Alternative Hypothesis
The alternative hypothesis is the claim tested against ; rejecting is evidence in favor of .
- Two-sided: (the models differ)
- One-sided: (model A is better)
In ML, two-sided tests are more conservative and generally preferred unless you have a strong prior reason to test only one direction.
p-Value
The p-value is the probability of observing a test statistic at least as extreme as the one computed from your data, assuming is true. The exact form depends on the alternative:
- One-sided ():
- Two-sided ():
Since two-sided tests are the default in ML model comparison, use the two-sided form unless you have committed in advance to a directional alternative. A small p-value means the observed data is unlikely under , which is evidence against .
The p-value is not the probability that is true. It is the probability of the data given , not the probability of given the data.
Significance Level
The significance level is the threshold for rejecting . If and only if , we reject and declare the result "statistically significant." The standard choice is , meaning we accept a 5% chance of falsely rejecting a true null hypothesis.
p = 0.05 is a convention, not a theorem
The 0.05 threshold comes from Fisher (1925), stated as a rough guide, not a principled cutoff. The ASA's 2016 and 2019 statements on p-values warn against thresholded dichotomous decisions. The replication crisis in psychology and biomedicine (Open Science Collaboration 2015; Ioannidis 2005) showed that a large fraction of "significant" findings at fail to replicate. In ML practice: test across multiple seeds, report effect sizes and confidence intervals, and treat on a single run as weak evidence.
Error Types
Type I and Type II Errors
Statement
There are two types of errors in hypothesis testing:
| true | false | |
|---|---|---|
| Reject | Type I error () | Correct (power) |
| Do not reject | Correct | Type II error () |
- Type I error (false positive): rejecting when it is true. Probability = .
- Type II error (false negative): failing to reject when it is false. Probability = .
- Power = : probability of correctly rejecting a false .
Intuition
Type I error is a false alarm: you claim the models are different when they are not. Type II error is a missed detection: the models truly differ but your test fails to detect it. Decreasing (being more conservative) increases (more missed detections) for fixed sample size.
Proof Sketch
By definition, , which is controlled directly by the choice of significance level. The power depends on the true effect size , the variance, and the sample size : .
Why It Matters
In ML, Type I errors lead to published claims of improvement that do not replicate. Type II errors lead to dismissing genuinely better methods. Both are costly. Understanding the tradeoff is essential for designing experiments with adequate power.
Failure Mode
Most ML experiments are underpowered: they use too few random seeds or train/test splits to detect real but small improvements. An underpowered test has high , meaning it frequently fails to detect true differences.
Confidence Intervals
Confidence Interval
A confidence interval for a parameter is a random interval whose coverage probability is at least :
for every parameter value in the model. When equality holds, the interval has exact coverage.
For the difference in model performance , a 95% CI provides a range of plausible values for the true difference. If the CI excludes zero, the difference is significant at .
Confidence intervals are more informative than p-values alone because they communicate both statistical significance and effect size.
The Multiple Comparisons Problem
When you test many hypotheses simultaneously, the probability of at least one false positive grows rapidly.
Bonferroni Correction
Statement
If you perform simultaneous hypothesis tests and want to control the family-wise error rate (FWER) at level , reject each individual test only if .
This holds regardless of the dependence structure among the tests.
Intuition
If you flip a coin 20 times looking for "evidence of bias," you expect at least one streak just by chance. Bonferroni compensates by making each individual test more stringent: if you test 20 hypotheses, you need for any single one.
Proof Sketch
By the union bound: .
Why It Matters
In ML, the multiple comparisons problem arises constantly: comparing many models, testing on multiple datasets, evaluating multiple metrics, running experiments with different hyperparameters. Without correction, the false positive rate can be far higher than the nominal .
Failure Mode
Bonferroni is very conservative: it controls the probability of any false positive, which can make it too strict when is large. Many true effects are missed. When you care about controlling the proportion of false positives rather than their existence, use FDR control instead.
Holm-Bonferroni (1979) is a uniformly more powerful alternative that still controls FWER at without additional assumptions. Sort the p-values , and reject iff for all . The smallest p-value is tested against (same as Bonferroni), but later ones use a larger threshold. Holm dominates Bonferroni: any rejection Bonferroni makes, Holm also makes. Prefer Holm whenever you would have used Bonferroni.
FDR Control: Benjamini-Hochberg
False Discovery Rate
The False Discovery Rate is the expected proportion of false positives among all rejected hypotheses:
The Benjamini-Hochberg (BH) procedure controls FDR at level :
- Order the p-values:
- Find the largest such that
- Reject all hypotheses with
BH is less conservative than Bonferroni and is preferred when testing many hypotheses (e.g., comparing models on 50 datasets).
BH controls FDR at level under independence or positive regression dependence of the test statistics (Benjamini-Hochberg 1995; Benjamini-Yekutieli 2001). Under arbitrary dependence, use the Benjamini-Yekutieli correction, which replaces the threshold with where . This is strictly more conservative but valid without structural assumptions on the joint distribution of p-values.
Paired Tests for Model Comparison
When comparing two models on the same data, you should use paired tests that account for the correlation between the two models' errors on each example.
Paired t-Test for Model Comparison
Given test examples, let be the difference in loss between models A and B on example . The paired t-test statistic is:
Under , follows a Student -distribution with degrees of freedom exactly when the are i.i.d. Gaussian, and approximately for non-Gaussian by the CLT when is large. The -distribution is robust to moderate departures from normality but breaks down for heavy-tailed or highly skewed ; in those cases use the Wilcoxon signed-rank test or a bootstrap.
Do not reuse a single held-out test set with a standard paired t-test across resamples
The standard paired t-test assumes independent . If you generate differences by repeated random train/test splits of a single dataset, the are positively correlated (the same examples appear in multiple test folds), the variance estimate is too small, and the nominal Type I error is inflated. Nadeau and Bengio (2003) proposed a corrected resampled t-test that inflates by to account for the overlap. Dietterich (1998) proposed an earlier alternative, the 5x2 cv paired t-test, which runs five 2-fold cross-validations and uses only one of the two folds per repetition to keep variance comparisons independent.
Wilcoxon Signed-Rank Test
A non-parametric alternative to the paired t-test. It does not assume normality of the differences . Instead, it ranks the absolute differences and compares the sum of ranks for positive vs. negative differences. Use this when the differences are not approximately normal (e.g., heavy-tailed or skewed distributions).
McNemar's Test
The appropriate paired test when the per-example outcome is binary (correct/incorrect). Build a contingency table of model A vs. model B outcomes on each test example and let be the count of examples where A is correct and B is wrong, the count where A is wrong and B is correct. Under that both models have the same error rate, the discordant pairs and are exchangeable, and
is asymptotically . For small counts () use the exact binomial form: where . McNemar's test only uses discordant pairs. Concordant pairs (both right or both wrong) carry no signal about which model is better.
Resampling Tests: Permutation and Bootstrap
Two distinct resampling schemes are both used for hypothesis testing and are often confused. They answer different questions.
Permutation Test
Use when is exchangeability of observations across groups (e.g., model A and model B perform identically on each example).
- Compute the observed test statistic (e.g., difference in accuracy).
- For each of iterations, permute the group labels (swap which model produced each prediction) and recompute on the relabeled data.
- The p-value is the fraction of permutation statistics at least as extreme: .
Permutation tests are exact under the exchangeability null and require no distributional assumptions.
Bootstrap Test
Use to approximate the sampling distribution of a statistic when you cannot derive it analytically and the sample is too small for asymptotic approximations.
- Compute the observed test statistic .
- For each of iterations, resample the data with replacement (not permute), compute , typically after recentering under .
- Construct confidence intervals from the empirical distribution of , or compute a p-value from the recentered statistics.
Bootstrap tests estimate a sampling distribution; permutation tests construct a null distribution by exchange. They coincide asymptotically under some conditions but are not interchangeable in small samples.
Common Confusions
p less than 0.05 does NOT mean 95% probability that H1 is true
This is the most common misinterpretation in all of statistics. The p-value is , not . To get the latter, you need Bayes' theorem and a prior on . A p-value of 0.03 means: "if the models were truly equal, there is a 3% chance of seeing this large a difference." It does not mean "there is a 97% chance that model A is better."
Statistical significance is not practical significance
A difference can be statistically significant (small p-value) but practically meaningless (tiny effect size). With enough data, you can detect arbitrarily small differences. Always report effect sizes and confidence intervals alongside p-values. A 0.1% accuracy improvement that is "statistically significant" may not matter in practice.
Do not use unpaired tests for paired data
When comparing models on the same test set, the predictions are correlated (both models see the same examples). An unpaired t-test ignores this correlation and has lower power. Always use a paired test (paired t-test, Wilcoxon signed-rank, or bootstrap with pairing) when the data is paired.
Summary
- The p-value is , not
- Type I error (false positive) rate is controlled by ; Type II (false negative) depends on power
- Multiple comparisons inflate false positives: use Bonferroni (strict) or Benjamini-Hochberg (less strict)
- Use paired tests (paired t-test, Wilcoxon, or McNemar for binary outcomes) when comparing models on the same data. For resampling-based estimates of generalization error, use the Nadeau-Bengio correction or Dietterich 5x2 cv t-test, not the naive paired t-test.
- Confidence intervals are more informative than p-values alone
- Bootstrap tests work when you cannot derive the test statistic distribution
Exercises
Problem
You compare 10 models on a test set and report p-values for each pairwise comparison. How many comparisons are there? If you use without correction, what is the approximate probability of at least one false positive (assuming all nulls are true)?
Problem
Design a proper statistical test to compare two classifiers (a fine-tuned BERT model and a logistic regression baseline) on a binary classification task with 1000 test examples. Specify: the null hypothesis, the test statistic, the type of test, and how you would compute the p-value.
References
Canonical:
- Fisher, Statistical Methods for Research Workers (1925), Ch. 4-5 - classical significance testing and the 0.05 convention
- Neyman and Pearson, "On the Problem of the Most Efficient Tests of Statistical Hypotheses" (1933)
- Demsar, "Statistical Comparisons of Classifiers over Multiple Datasets" (JMLR 2006) — Friedman + Nemenyi workflow for multi-dataset comparisons
- Benjamini and Hochberg, "Controlling the False Discovery Rate" (1995)
- Benjamini and Yekutieli, "The Control of the False Discovery Rate in Multiple Testing under Dependency" (Annals of Statistics 2001)
- Holm, "A Simple Sequentially Rejective Multiple Test Procedure" (Scandinavian J. Statistics 1979)
- Dietterich, "Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms" (Neural Computation 1998) — 5x2 cv and McNemar's for ML
- Nadeau and Bengio, "Inference for the Generalization Error" (Machine Learning 2003) — corrected resampled t-test for cross-validation
Current:
- Wasserstein and Lazar, "The ASA Statement on p-Values" (The American Statistician, 2016)
- Wasserstein, Schirm, and Lazar, "Moving to a World Beyond p < 0.05" (The American Statistician, 2019)
- Bouthillier et al., "Accounting for Variance in Machine Learning Benchmarks" (MLSys 2021)
- Benavoli et al., "Time for a Change: A Tutorial for Comparing Multiple Classifiers Through Bayesian Analysis" (JMLR 2017)
- Dror et al., "Deep Dominance: How to Properly Compare Deep Neural Models" (ACL 2019) — almost-stochastic dominance
Next Topics
- Statistical significance and multiple comparisons: deeper treatment of FDR, permutation tests, and replication
- Bootstrap methods: the general bootstrap framework for inference
Last reviewed: April 22, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
12- Fisher Information: Curvature, KL Geometry, and the Natural Gradientlayer 0B · tier 1
- Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiencylayer 0B · tier 1
- Confusion Matrix: MCC, Kappa, and Cost-Sensitive Evaluationlayer 1 · tier 1
- Method of Momentslayer 0B · tier 2
- Sufficient Statistics and Exponential Familieslayer 0B · tier 2
Derived topics
18- Analysis of Variancelayer 1 · tier 1
- Chi-Squared Distribution and Testslayer 1 · tier 1
- F-Distribution and ANOVAlayer 1 · tier 1
- Student-t Distribution and t-Testlayer 1 · tier 1
- Permutation Testslayer 2 · tier 1
+13 more on the derived-topics page.
Graph-backed continuations