Statistical Significance and Multiple Comparisons

Sneiderman, Robby

Methodology

Statistical Significance and Multiple Comparisons

p-values, significance levels, confidence intervals, and the multiple comparisons problem. Bonferroni correction, Benjamini-Hochberg FDR control, and why these matter for model selection and benchmark evaluation.

CoreTier 2StableSupporting~50 min

Prerequisites

Hypothesis Testing for ML

Quiz (5)Pulse Check Prereq Map

Why This Matters

You train five models, tune three hyperparameters each, evaluate on four metrics, and report the best combination. How many implicit hypothesis tests did you just run? Roughly $5 \times 3 \times 4 = 60$ . Without correction, the probability of at least one spurious "significant" result exceeds 95%. This is the multiple comparisons problem, and it is the single largest source of false claims in ML evaluation.

Formal Setup

Definition

p-Value $p$

The p-value for a test statistic $T$ is the probability under the null of observing a value at least as extreme as $t_{\text{obs}}$ . For a one-sided alternative $T$ large:

$p = P(T \geq t_{\text{obs}} \mid H_0)$

For a two-sided alternative:

$p = P(|T| \geq |t_{\text{obs}}| \mid H_0)$

A small $p$ means the data is unlikely under $H_0$ . The p-value is not the probability that $H_0$ is true, and $1 - p$ is not the probability that the alternative is true.

Definition

Significance Level $α$

The significance level $\alpha$ is the threshold for rejecting $H_0$ . Reject $H_0$ if and only if $p \leq \alpha$ . The standard choice $\alpha = 0.05$ means you accept a 5% false positive rate for any single test.

Definition

Confidence Interval

A $(1 - \alpha)$ confidence interval for parameter $\theta$ is a random interval $[L, U]$ such that $P(\theta \in [L, U]) \geq 1 - \alpha$ over repeated sampling. For the difference in model performance $\delta = \mu_A - \mu_B$ :

$\hat{\delta} \pm z_{\alpha/2} \cdot \text{SE}(\hat{\delta})$

where $\text{SE}$ is the standard error. The interval excludes zero if and only if the difference is significant at level $\alpha$ .

The Multiple Comparisons Problem

When testing $m$ hypotheses simultaneously at level $\alpha$ each, the probability of at least one false positive grows rapidly.

Definition

Family-Wise Error Rate $F W E R$

The FWER is the probability of making at least one Type I error across all $m$ tests:

$\text{FWER} = P(\text{at least one false rejection}) = 1 - (1 - \alpha)^m$

when the tests are independent. For $m = 20$ and $\alpha = 0.05$ : $\text{FWER} \approx 1 - 0.95^{20} \approx 0.64$ .

Bonferroni Correction

Theorem

Bonferroni Correction

Statement

To control the family-wise error rate at level $\alpha$ across $m$ tests, reject hypothesis $i$ only if $p_i \leq \alpha / m$ . Then:

$\text{FWER} = P\left(\bigcup_{i \in \mathcal{H}_0} \{p_i \leq \alpha/m\}\right) \leq \sum_{i \in \mathcal{H}_0} P(p_i \leq \alpha/m) \leq m_0 \cdot \frac{\alpha}{m} \leq \alpha$

where $m_0 \leq m$ is the number of true null hypotheses.

Intuition

If you test 20 hypotheses at $\alpha = 0.05$ , Bonferroni requires $p \leq 0.0025$ for each individual test. You divide your error budget equally among all tests. The union bound guarantees this works regardless of whether the tests are correlated.

Proof Sketch

By the union bound: $P(\exists i \in \mathcal{H}_0: p_i \leq \alpha/m) \leq \sum_{i \in \mathcal{H}_0} P(p_i \leq \alpha/m) = m_0 \cdot (\alpha/m) \leq \alpha$ . The key step is that for a true null, $p_i$ is uniform on $[0,1]$ , so $P(p_i \leq \alpha/m) = \alpha/m$ .

Why It Matters

Bonferroni is the simplest multiple testing correction. It requires no assumptions about the dependence structure among tests. In ML, use it when comparing a small number of models (say, 5 or fewer) and you need a strong guarantee that no comparison is a false positive.

Failure Mode

Bonferroni is very conservative when $m$ is large. With $m = 1000$ tests, you need $p \leq 0.00005$ per test. Many true effects will be missed. When you care about controlling the proportion of false discoveries rather than their existence, use Benjamini-Hochberg instead. Even when you insist on FWER control, Holm's step-down procedure (Holm 1979) uniformly dominates Bonferroni: order the p-values $p_{(1)} \leq \cdots \leq p_{(m)}$ and reject $H_{(i)}$ while $p_{(i)} \leq \alpha/(m - i + 1)$ , stopping at the first failure. Holm still controls FWER at level $\alpha$ and always rejects a (weak) superset of what Bonferroni rejects, so use it by default.

report a correction →

Benjamini-Hochberg FDR Control

Definition

False Discovery Rate $F D R$

The False Discovery Rate is the expected proportion of false positives among all rejected hypotheses:

$\text{FDR} = \mathbb{E}\left[\frac{V}{R \vee 1}\right]$

where $V$ is the number of false rejections, $R$ is the total number of rejections, and $R \vee 1 = \max(R, 1)$ avoids division by zero.

Theorem

Benjamini-Hochberg Procedure

Statement

Order the p-values $p_{(1)} \leq p_{(2)} \leq \cdots \leq p_{(m)}$ . Find the largest $k$ such that $p_{(k)} \leq \frac{k}{m} q$ . Reject all hypotheses $H_{(1)}, \ldots, H_{(k)}$ . Then $\text{FDR} \leq q$ .

Intuition

BH draws a line with slope $q/m$ through the ordered p-values. The largest p-value below this line determines the cutoff. Hypotheses with smaller p-values are rejected. The procedure is adaptive: if many p-values are small (many true effects), the threshold is more lenient than Bonferroni.

Proof Sketch

The original proof by Benjamini and Hochberg (1995) proceeds by induction on the number of true nulls $m_0$ . The key insight: under independence, the expected number of false rejections at the $k$ -th threshold is $m_0 \cdot k q / (m \cdot k) = m_0 q / m \leq q$ . The proof extends to positive regression dependence (PRDS) by the work of Benjamini and Yekutieli (2001).

Why It Matters

BH is the standard correction for large-scale testing in ML. When comparing models across many datasets, metrics, or hyperparameter settings, BH controls the fraction of false discoveries rather than the probability of any false discovery. This is almost always the right notion for ML practitioners.

Failure Mode

BH requires independence or positive regression dependence (PRDS) among p-values (Benjamini-Yekutieli 2001). If tests are negatively correlated (rare in practice but possible) or dependence is unknown, FDR may exceed $q$ . Under arbitrary dependence, use the Benjamini-Yekutieli correction, which replaces $q$ with $q / c(m)$ where $c(m) = \sum_{i=1}^m 1/i$ is the harmonic number. For large $m$ , $c(m) \approx \ln m + \gamma$ (Euler-Mascheroni $\gamma \approx 0.577$ ), so the BY threshold is roughly $q / \ln m$ , which costs a factor of about $\ln m$ in power. Do not apply the approximation for small $m$ : at $m = 20$ , $c(20) \approx 3.60$ while $\ln 20 \approx 3.00$ .

report a correction →

Why This Matters for ML

Model Selection

When you compare $k$ models, you are implicitly running $\binom{k}{2}$ pairwise tests. With 10 models, that is 45 tests. Reporting the "best" model without correction inflates the false positive rate.

Hyperparameter Tuning

Each hyperparameter configuration evaluated on a validation set is an implicit hypothesis test. Random search over 100 configurations at $\alpha = 0.05$ yields an expected 5 false positives. Cross-validation helps but does not eliminate this problem.

Benchmark Comparisons

Papers comparing on multiple benchmarks (GLUE has 9 tasks, for example) should correct for multiple comparisons when claiming improvements "across the board."

Example

Bonferroni vs BH on benchmark evaluation

You compare your model against a baseline on 20 benchmark tasks. The sorted p-values are: 0.001, 0.003, 0.008, 0.012, 0.025, 0.04, 0.06, ... (remaining above 0.05).

Bonferroni ( $\alpha = 0.05$ ): threshold is $0.05/20 = 0.0025$ . Only the first p-value (0.001) passes. You claim significance on 1 task.

BH ( $q = 0.05$ ): check each $p_{(k)} \leq k \cdot 0.05/20$ . Thresholds are 0.0025, 0.005, 0.0075, 0.01, 0.0125, ... Walk through in order: $k=1$ passes (0.001 $\leq$ 0.0025), $k=2$ passes (0.003 $\leq$ 0.005), $k=3$ fails (0.008 $>$ 0.0075). BH is a step-up procedure: find the largest $k^\star$ such that $p_{(k^\star)} \leq k^\star \cdot q/m$ , then reject $H_{(1)}, \ldots, H_{(k^\star)}$ . Here $k^\star = 2$ , so BH rejects the first 2 hypotheses. You claim significance on 2 tasks.

Note: BH does not stop at the first failure. It scans all $k$ and takes the largest passing index. In this example the only passing indices are $k=1,2$ , so the answer coincides with the first-failure index minus one. For other p-value sequences (e.g., a late-arriving small p-value), the largest- $k$ rule can reject more than the first-failure rule would.

BH discovers more true effects than Bonferroni while controlling the false discovery proportion.

Common Confusions

Watch Out

p-hacking is implicit multiple testing

If you try many analysis strategies (different features, different preprocessing, different splits) and report only the one that gives $p < 0.05$ , you have performed many implicit tests without correction. This is p-hacking. The fix is to preregister your analysis plan or apply multiple testing correction to all analyses you tried.

Watch Out

Garden of forking paths: you do not have to run multiple tests to multiply test

Gelman and Loken (2013) make the sharper point: even researchers who run only a single analysis can commit de facto multiple testing, because the choice of analysis was contingent on the data. A covariate was dropped because it looked noisy. A split was chosen after a peek. An outlier rule was applied only when results were marginal. Each data-dependent choice defines a branch in the garden of forking paths, and the reported p-value ignores all the branches that did not get taken. Preregistration of the full analysis pipeline is the only clean fix.

Watch Out

Bonferroni and BH control different error rates

Bonferroni controls the probability of any false positive (FWER). BH controls the proportion of false positives (FDR). These are different quantities. FWER is appropriate when any false positive is costly (e.g., clinical trials). FDR is appropriate when you expect some false positives and want to control their rate (e.g., screening many benchmark tasks).

Watch Out

Post-hoc correction does not fix bad experimental design

If your experiment has data leakage or other methodological flaws, no multiple testing correction will save you. Corrections adjust p-value thresholds; they do not fix biased test statistics. Always ensure the individual tests are valid before applying corrections.

Summary

The p-value is $P(\text{data} \mid H_0)$ , not $P(H_0 \mid \text{data})$
Testing $m$ hypotheses at $\alpha$ each gives FWER $\approx 1 - (1-\alpha)^m$
Bonferroni: reject if $p_i \leq \alpha/m$ . Controls FWER. Conservative for large $m$
BH: order p-values, find largest $k$ with $p_{(k)} \leq kq/m$ . Controls FDR
ML model selection, hyperparameter tuning, and benchmark evaluation all involve implicit multiple testing
Confidence intervals convey both significance and effect size

Exercises

ExerciseCore

Problem

You evaluate a model on 8 benchmark datasets and obtain p-values (vs. baseline) of 0.004, 0.01, 0.02, 0.03, 0.06, 0.08, 0.12, 0.25. Apply both Bonferroni and BH at level 0.05. How many datasets show significant improvement under each?

ExerciseAdvanced

Problem

You run a random hyperparameter search with 200 configurations. The best configuration has validation accuracy 0.5% higher than the second best. You report this as your final result. Explain why this is problematic from a multiple comparisons perspective, and propose a fix.

References

Canonical:

Benjamini and Hochberg, "Controlling the False Discovery Rate" (1995)
Demsar, "Statistical Comparisons of Classifiers over Multiple Datasets" (JMLR 2006)

Current:

Bouthillier et al., "Accounting for Variance in Machine Learning Benchmarks" (MLSys 2021)
Recht et al., "Do ImageNet Classifiers Generalize to ImageNet?" (ICML 2019)
Holm, "A Simple Sequentially Rejective Multiple Test Procedure" (Scand. J. Stat. 1979)
Benjamini and Yekutieli, "The Control of the False Discovery Rate Under Dependency" (Annals of Statistics 2001)
Storey, "A Direct Approach to False Discovery Rates" (JRSS-B 2002, q-values)
Gelman and Loken, "The Garden of Forking Paths" (2013 working paper, later in Am. Scientist 2014)
Dror, Baumer, Shlomov, Reichart, "The Hitchhiker's Guide to Testing Statistical Significance in NLP" (ACL 2018)

Next Topics

Bootstrap methods: nonparametric inference for any test statistic
Cross-validation theory: proper evaluation under limited data

Last reviewed: April 18, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

1

Hypothesis Testing for MLlayer 2 · tier 2

Derived topics

2

Bootstrap Methodslayer 2 · tier 1
Cross-Validation Theorylayer 2 · tier 2

Graph-backed continuations

Bootstrap Methods Cross-Validation Theory