Skip to main content

Methodology

Statistical Significance and Multiple Comparisons

p-values, significance levels, confidence intervals, and the multiple comparisons problem. Bonferroni correction, Benjamini-Hochberg FDR control, and why these matter for model selection and benchmark evaluation.

CoreTier 2StableSupporting~50 min

Why This Matters

You train five models, tune three hyperparameters each, evaluate on four metrics, and report the best combination. How many implicit hypothesis tests did you just run? Roughly 5×3×4=605 \times 3 \times 4 = 60. Without correction, the probability of at least one spurious "significant" result exceeds 95%. This is the multiple comparisons problem, and it is the single largest source of false claims in ML evaluation.

Formal Setup

Definition

p-Value

The p-value for a test statistic TT is the probability under the null of observing a value at least as extreme as tobst_{\text{obs}}. For a one-sided alternative TT large:

p=P(TtobsH0)p = P(T \geq t_{\text{obs}} \mid H_0)

For a two-sided alternative:

p=P(TtobsH0)p = P(|T| \geq |t_{\text{obs}}| \mid H_0)

A small pp means the data is unlikely under H0H_0. The p-value is not the probability that H0H_0 is true, and 1p1 - p is not the probability that the alternative is true.

Definition

Significance Level

The significance level α\alpha is the threshold for rejecting H0H_0. Reject H0H_0 if and only if pαp \leq \alpha. The standard choice α=0.05\alpha = 0.05 means you accept a 5% false positive rate for any single test.

Definition

Confidence Interval

A (1α)(1 - \alpha) confidence interval for parameter θ\theta is a random interval [L,U][L, U] such that P(θ[L,U])1αP(\theta \in [L, U]) \geq 1 - \alpha over repeated sampling. For the difference in model performance δ=μAμB\delta = \mu_A - \mu_B:

δ^±zα/2SE(δ^)\hat{\delta} \pm z_{\alpha/2} \cdot \text{SE}(\hat{\delta})

where SE\text{SE} is the standard error. The interval excludes zero if and only if the difference is significant at level α\alpha.

The Multiple Comparisons Problem

When testing mm hypotheses simultaneously at level α\alpha each, the probability of at least one false positive grows rapidly.

Definition

Family-Wise Error Rate

The FWER is the probability of making at least one Type I error across all mm tests:

FWER=P(at least one false rejection)=1(1α)m\text{FWER} = P(\text{at least one false rejection}) = 1 - (1 - \alpha)^m

when the tests are independent. For m=20m = 20 and α=0.05\alpha = 0.05: FWER10.95200.64\text{FWER} \approx 1 - 0.95^{20} \approx 0.64.

Bonferroni Correction

Theorem

Bonferroni Correction

Statement

To control the family-wise error rate at level α\alpha across mm tests, reject hypothesis ii only if piα/mp_i \leq \alpha / m. Then:

FWER=P(iH0{piα/m})iH0P(piα/m)m0αmα\text{FWER} = P\left(\bigcup_{i \in \mathcal{H}_0} \{p_i \leq \alpha/m\}\right) \leq \sum_{i \in \mathcal{H}_0} P(p_i \leq \alpha/m) \leq m_0 \cdot \frac{\alpha}{m} \leq \alpha

where m0mm_0 \leq m is the number of true null hypotheses.

Intuition

If you test 20 hypotheses at α=0.05\alpha = 0.05, Bonferroni requires p0.0025p \leq 0.0025 for each individual test. You divide your error budget equally among all tests. The union bound guarantees this works regardless of whether the tests are correlated.

Proof Sketch

By the union bound: P(iH0:piα/m)iH0P(piα/m)=m0(α/m)αP(\exists i \in \mathcal{H}_0: p_i \leq \alpha/m) \leq \sum_{i \in \mathcal{H}_0} P(p_i \leq \alpha/m) = m_0 \cdot (\alpha/m) \leq \alpha. The key step is that for a true null, pip_i is uniform on [0,1][0,1], so P(piα/m)=α/mP(p_i \leq \alpha/m) = \alpha/m.

Why It Matters

Bonferroni is the simplest multiple testing correction. It requires no assumptions about the dependence structure among tests. In ML, use it when comparing a small number of models (say, 5 or fewer) and you need a strong guarantee that no comparison is a false positive.

Failure Mode

Bonferroni is very conservative when mm is large. With m=1000m = 1000 tests, you need p0.00005p \leq 0.00005 per test. Many true effects will be missed. When you care about controlling the proportion of false discoveries rather than their existence, use Benjamini-Hochberg instead. Even when you insist on FWER control, Holm's step-down procedure (Holm 1979) uniformly dominates Bonferroni: order the p-values p(1)p(m)p_{(1)} \leq \cdots \leq p_{(m)} and reject H(i)H_{(i)} while p(i)α/(mi+1)p_{(i)} \leq \alpha/(m - i + 1), stopping at the first failure. Holm still controls FWER at level α\alpha and always rejects a (weak) superset of what Bonferroni rejects, so use it by default.

Benjamini-Hochberg FDR Control

Definition

False Discovery Rate

The False Discovery Rate is the expected proportion of false positives among all rejected hypotheses:

FDR=E[VR1]\text{FDR} = \mathbb{E}\left[\frac{V}{R \vee 1}\right]

where VV is the number of false rejections, RR is the total number of rejections, and R1=max(R,1)R \vee 1 = \max(R, 1) avoids division by zero.

Theorem

Benjamini-Hochberg Procedure

Statement

Order the p-values p(1)p(2)p(m)p_{(1)} \leq p_{(2)} \leq \cdots \leq p_{(m)}. Find the largest kk such that p(k)kmqp_{(k)} \leq \frac{k}{m} q. Reject all hypotheses H(1),,H(k)H_{(1)}, \ldots, H_{(k)}. Then FDRq\text{FDR} \leq q.

Intuition

BH draws a line with slope q/mq/m through the ordered p-values. The largest p-value below this line determines the cutoff. Hypotheses with smaller p-values are rejected. The procedure is adaptive: if many p-values are small (many true effects), the threshold is more lenient than Bonferroni.

Proof Sketch

The original proof by Benjamini and Hochberg (1995) proceeds by induction on the number of true nulls m0m_0. The key insight: under independence, the expected number of false rejections at the kk-th threshold is m0kq/(mk)=m0q/mqm_0 \cdot k q / (m \cdot k) = m_0 q / m \leq q. The proof extends to positive regression dependence (PRDS) by the work of Benjamini and Yekutieli (2001).

Why It Matters

BH is the standard correction for large-scale testing in ML. When comparing models across many datasets, metrics, or hyperparameter settings, BH controls the fraction of false discoveries rather than the probability of any false discovery. This is almost always the right notion for ML practitioners.

Failure Mode

BH requires independence or positive regression dependence (PRDS) among p-values (Benjamini-Yekutieli 2001). If tests are negatively correlated (rare in practice but possible) or dependence is unknown, FDR may exceed qq. Under arbitrary dependence, use the Benjamini-Yekutieli correction, which replaces qq with q/c(m)q / c(m) where c(m)=i=1m1/ic(m) = \sum_{i=1}^m 1/i is the harmonic number. For large mm, c(m)lnm+γc(m) \approx \ln m + \gamma (Euler-Mascheroni γ0.577\gamma \approx 0.577), so the BY threshold is roughly q/lnmq / \ln m, which costs a factor of about lnm\ln m in power. Do not apply the approximation for small mm: at m=20m = 20, c(20)3.60c(20) \approx 3.60 while ln203.00\ln 20 \approx 3.00.

Why This Matters for ML

Model Selection

When you compare kk models, you are implicitly running (k2)\binom{k}{2} pairwise tests. With 10 models, that is 45 tests. Reporting the "best" model without correction inflates the false positive rate.

Hyperparameter Tuning

Each hyperparameter configuration evaluated on a validation set is an implicit hypothesis test. Random search over 100 configurations at α=0.05\alpha = 0.05 yields an expected 5 false positives. Cross-validation helps but does not eliminate this problem.

Benchmark Comparisons

Papers comparing on multiple benchmarks (GLUE has 9 tasks, for example) should correct for multiple comparisons when claiming improvements "across the board."

Example

Bonferroni vs BH on benchmark evaluation

You compare your model against a baseline on 20 benchmark tasks. The sorted p-values are: 0.001, 0.003, 0.008, 0.012, 0.025, 0.04, 0.06, ... (remaining above 0.05).

Bonferroni (α=0.05\alpha = 0.05): threshold is 0.05/20=0.00250.05/20 = 0.0025. Only the first p-value (0.001) passes. You claim significance on 1 task.

BH (q=0.05q = 0.05): check each p(k)k0.05/20p_{(k)} \leq k \cdot 0.05/20. Thresholds are 0.0025, 0.005, 0.0075, 0.01, 0.0125, ... Walk through in order: k=1k=1 passes (0.001 \leq 0.0025), k=2k=2 passes (0.003 \leq 0.005), k=3k=3 fails (0.008 >> 0.0075). BH is a step-up procedure: find the largest kk^\star such that p(k)kq/mp_{(k^\star)} \leq k^\star \cdot q/m, then reject H(1),,H(k)H_{(1)}, \ldots, H_{(k^\star)}. Here k=2k^\star = 2, so BH rejects the first 2 hypotheses. You claim significance on 2 tasks.

Note: BH does not stop at the first failure. It scans all kk and takes the largest passing index. In this example the only passing indices are k=1,2k=1,2, so the answer coincides with the first-failure index minus one. For other p-value sequences (e.g., a late-arriving small p-value), the largest-kk rule can reject more than the first-failure rule would.

BH discovers more true effects than Bonferroni while controlling the false discovery proportion.

Common Confusions

Watch Out

p-hacking is implicit multiple testing

If you try many analysis strategies (different features, different preprocessing, different splits) and report only the one that gives p<0.05p < 0.05, you have performed many implicit tests without correction. This is p-hacking. The fix is to preregister your analysis plan or apply multiple testing correction to all analyses you tried.

Watch Out

Garden of forking paths: you do not have to run multiple tests to multiply test

Gelman and Loken (2013) make the sharper point: even researchers who run only a single analysis can commit de facto multiple testing, because the choice of analysis was contingent on the data. A covariate was dropped because it looked noisy. A split was chosen after a peek. An outlier rule was applied only when results were marginal. Each data-dependent choice defines a branch in the garden of forking paths, and the reported p-value ignores all the branches that did not get taken. Preregistration of the full analysis pipeline is the only clean fix.

Watch Out

Bonferroni and BH control different error rates

Bonferroni controls the probability of any false positive (FWER). BH controls the proportion of false positives (FDR). These are different quantities. FWER is appropriate when any false positive is costly (e.g., clinical trials). FDR is appropriate when you expect some false positives and want to control their rate (e.g., screening many benchmark tasks).

Watch Out

Post-hoc correction does not fix bad experimental design

If your experiment has data leakage or other methodological flaws, no multiple testing correction will save you. Corrections adjust p-value thresholds; they do not fix biased test statistics. Always ensure the individual tests are valid before applying corrections.

Summary

  • The p-value is P(dataH0)P(\text{data} \mid H_0), not P(H0data)P(H_0 \mid \text{data})
  • Testing mm hypotheses at α\alpha each gives FWER 1(1α)m\approx 1 - (1-\alpha)^m
  • Bonferroni: reject if piα/mp_i \leq \alpha/m. Controls FWER. Conservative for large mm
  • BH: order p-values, find largest kk with p(k)kq/mp_{(k)} \leq kq/m. Controls FDR
  • ML model selection, hyperparameter tuning, and benchmark evaluation all involve implicit multiple testing
  • Confidence intervals convey both significance and effect size

Exercises

ExerciseCore

Problem

You evaluate a model on 8 benchmark datasets and obtain p-values (vs. baseline) of 0.004, 0.01, 0.02, 0.03, 0.06, 0.08, 0.12, 0.25. Apply both Bonferroni and BH at level 0.05. How many datasets show significant improvement under each?

ExerciseAdvanced

Problem

You run a random hyperparameter search with 200 configurations. The best configuration has validation accuracy 0.5% higher than the second best. You report this as your final result. Explain why this is problematic from a multiple comparisons perspective, and propose a fix.

References

Canonical:

  • Benjamini and Hochberg, "Controlling the False Discovery Rate" (1995)
  • Demsar, "Statistical Comparisons of Classifiers over Multiple Datasets" (JMLR 2006)

Current:

  • Bouthillier et al., "Accounting for Variance in Machine Learning Benchmarks" (MLSys 2021)
  • Recht et al., "Do ImageNet Classifiers Generalize to ImageNet?" (ICML 2019)
  • Holm, "A Simple Sequentially Rejective Multiple Test Procedure" (Scand. J. Stat. 1979)
  • Benjamini and Yekutieli, "The Control of the False Discovery Rate Under Dependency" (Annals of Statistics 2001)
  • Storey, "A Direct Approach to False Discovery Rates" (JRSS-B 2002, q-values)
  • Gelman and Loken, "The Garden of Forking Paths" (2013 working paper, later in Am. Scientist 2014)
  • Dror, Baumer, Shlomov, Reichart, "The Hitchhiker's Guide to Testing Statistical Significance in NLP" (ACL 2018)

Next Topics

Last reviewed: April 18, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

1

Derived topics

2