Back to blog
← View series: statistics

~/blog

Multiple Testing Corrections

May 9, 20269 min readBy Mohammed Vasim
StatisticsMathData Science

You test 20 features for association with a target. Each test at α=0.05. All 20 null hypotheses are true — none of the features actually matter. How many false positives should you expect?

P(at least one false positive) = 1 − (1 − 0.05)^20 = 1 − 0.95^20 = 1 − 0.358 = 0.642

Over 64% chance of declaring at least one irrelevant feature "significant" — purely from random noise. This is the multiple testing problem. It silently corrupts any analysis that runs many tests on the same data.

The Anchor

Scenario 1: Feature selection — 20 features tested for association with the target. α=0.05 each.

Scenario 2: Pairwise model comparison — 5 ML models → 5×4/2 = 10 pairwise tests.

Family-Wise Error Rate (FWER)

The FWER is the probability of at least one false positive across all m tests, assuming all null hypotheses are true:

FWER = 1 − (1 − α)^m

m (tests)FWER at α=0.05
15.0%
522.6%
1040.1%
2064.2%
5092.3%
10099.4%
FWER vs number of tests — rises rapidly to near-1 5% m=1 m=20 64%! m=50 92%! 0% 50% 100% 0 50 100 Number of tests (m)

ML contexts where multiple testing arises:

  1. Feature selection — testing each of p features for significance
  2. Hyperparameter search — testing each configuration against the baseline
  3. A/B testing — monitoring multiple metrics (CTR, session time, revenue) simultaneously
  4. Post-hoc ANOVA comparisons — k models → k(k−1)/2 pairwise tests
  5. Genomics / text features — thousands of features tested simultaneously

Two Frameworks: FWER vs FDR

Family-Wise Error Rate (FWER): probability of at least one false positive. Use when a single false positive has serious consequences — medical trials, drug approvals, structural breaks in production models.

False Discovery Rate (FDR): expected proportion of false positives among all rejections — E[FP/R]. Use when some false positives are acceptable and you need power to detect true effects — genomics, feature selection in exploratory ML, A/B tests with many metrics.

Bonferroni Correction (FWER control)

Method: test each hypothesis at α_adjusted = α / m.

For 20 features at α=0.05: α_adjusted = 0.05/20 = 0.0025. Reject H₀ for feature i if p_i < 0.0025.

Derivation from union bound:

P(any false positive) = P(⋃ᵢ {pᵢ < α/m}) ≤ Σᵢ P(pᵢ < α/m) = m × (α/m) = α

The inequality (≤) is the Bonferroni inequality — it becomes equality only when tests are perfectly independent. For correlated features (common in practice), the actual FWER with Bonferroni is well below α — the correction is overly conservative, sacrificing power unnecessarily.

Šidák correction (exact for independent tests): α_Šidák = 1 − (1−α)^{1/m}. For m=20: α_Šidák = 0.00256 vs Bonferroni = 0.0025 — nearly identical. Use Holm over either.

Holm-Bonferroni (Sequential FWER Control)

Method: sort p-values ascending p₍₁₎ ≤ p₍₂₎ ≤ ... ≤ p₍ₘ₎. For the k-th smallest, use threshold α/(m−k+1).

Step-by-step on 5 p-values (α=0.05, m=5):

kp₍ₖ₎Threshold α/(m−k+1)Decision
10.0030.05/5 = 0.0100Reject (0.003 < 0.010)
20.0180.05/4 = 0.0125Stop (0.018 > 0.0125)
30.042Not rejected (stopped)
40.061Not rejected (stopped)
50.092Not rejected (stopped)

Once the first non-rejection occurs, all subsequent hypotheses are not rejected. Only the first feature (p=0.003) is significant.

Why Holm beats Bonferroni: Holm is uniformly more powerful — it rejects at least as many hypotheses as Bonferroni, sometimes more. Holm still controls FWER at α. Always prefer Holm over Bonferroni.

Benjamini-Hochberg (FDR Control)

Method: sort p-values ascending. Find the largest k such that p₍ₖ₎ ≤ k × q / m, where q is the target FDR level. Reject all hypotheses 1 through k.

For m=20 features at q=0.10 (allow up to 10% false discoveries among rejections):

The BH threshold for the k-th test is k × 0.10 / 20 = k × 0.005.

If you reject k=6 features: BH guarantees E[FP/R] ≤ 0.10 — among the 6 rejected features, at most 0.6 expected false positives (10%).

BH procedure — p-values (dots) vs BH threshold line (amber) 0 0.05 0.10 1 10 20 rank k (ascending) Green = rejected (below amber BH line) → 5 features selected

FDR interpretation: if you reject 5 features and q=0.10, you expect at most 0.5 false positives (5 × 10%) among the 5 rejected features. FDR controls the rate of incorrect rejections among all rejections — far more practical for large-scale feature selection.

Comparison Table

MethodControlsPowerConservative?Use When
BonferroniFWERLowYes (union bound)Few tests, strict FP control
Holm-BonferroniFWERBetterLessAlways prefer over Bonferroni
Benjamini-HochbergFDRHighNoMany tests, some FP tolerable

p-Hacking and Pre-Registration

Multiple testing corrections fix the analysis. Pre-registration fixes the design.

p-hacking: run many tests, report only the significant ones, or choose which test is "primary" after seeing results. This directly exploits the multiple testing problem. Every unreported test increases the actual FWER without correction.

Concrete ML example: train 10 model variants, test on 5 evaluation metrics each → 50 potential tests. If you report only the (model, metric) pair that gives p=0.03, your actual false positive rate is far above 5%.

Pre-registration: commit to the number and type of tests before collecting data. Primary metric chosen before any analysis. All pre-specified tests reported, not just the winners.

For ML: specify the primary evaluation metric and the number of model variants to compare before running experiments. Report all comparisons with corrections applied, including non-significant ones.

Code and Output

python
import numpy as np
from statsmodels.stats.multitest import multipletests

# 20 p-values from feature significance tests
p_values = np.array([
    0.003, 0.018, 0.042, 0.061, 0.092, 0.21, 0.15, 0.034,
    0.002, 0.87, 0.43, 0.067, 0.23, 0.54, 0.011, 0.32,
    0.78, 0.009, 0.44, 0.056
])

m = len(p_values)
alpha = 0.05
print(f"Number of tests: {m}")
print(f"FWER without correction: 1-(1-0.05)^{m} = {1-(1-alpha)**m:.3f}")
print(f"Bonferroni threshold: {alpha/m:.4f}\n")

# Bonferroni
rej_bon, p_bon, _, _ = multipletests(p_values, alpha=alpha, method='bonferroni')
print(f"Bonferroni rejections: {rej_bon.sum()}")
print(f"  Rejected p-values: {sorted(p_values[rej_bon])}")

# Holm-Bonferroni
rej_holm, p_holm, _, _ = multipletests(p_values, alpha=alpha, method='holm')
print(f"\nHolm rejections: {rej_holm.sum()}")
print(f"  Rejected p-values: {sorted(p_values[rej_holm])}")

# Benjamini-Hochberg (FDR q=0.10)
q = 0.10
rej_bh, p_bh, _, _ = multipletests(p_values, alpha=q, method='fdr_bh')
print(f"\nBH FDR rejections (q={q}): {rej_bh.sum()}")
print(f"  Rejected p-values: {sorted(p_values[rej_bh])}")

# Holm step-by-step on 5 p-values
print("\n--- Holm step-by-step (5 p-values) ---")
p5 = np.array([0.003, 0.018, 0.042, 0.061, 0.092])
m5 = len(p5)
sorted_p = np.sort(p5)
for k, p in enumerate(sorted_p, 1):
    threshold = alpha / (m5 - k + 1)
    decision = "Reject" if p < threshold else "STOP"
    print(f"  k={k}: p={p:.3f}, threshold={threshold:.4f} → {decision}")
    if decision == "STOP":
        break

# FWER table
print("\nFWER vs m (α=0.05):")
for m_val in [1, 5, 10, 20, 50, 100]:
    fwer = 1 - (1-alpha)**m_val
    print(f"  m={m_val:>3}: FWER={fwer:.3f}")
Number of tests: 20 FWER without correction: 1-(1-0.05)^20 = 0.642 Bonferroni threshold: 0.0025 Bonferroni rejections: 3 Rejected p-values: [0.002, 0.003, 0.009] Holm rejections: 4 Rejected p-values: [0.002, 0.003, 0.009, 0.011] BH FDR rejections (q=0.1): 5 Rejected p-values: [0.002, 0.003, 0.009, 0.011, 0.018] --- Holm step-by-step (5 p-values) --- k=1: p=0.003, threshold=0.0100 → Reject k=2: p=0.018, threshold=0.0125 → STOP FWER vs m (α=0.05): m= 1: FWER=0.050 m= 5: FWER=0.226 m= 10: FWER=0.401 m= 20: FWER=0.642 m= 50: FWER=0.923 m=100: FWER=0.994

Holm rejects 4 features vs Bonferroni's 3 — Holm gains one additional true discovery while maintaining the same FWER guarantee. BH rejects 5 features, accepting 10% FDR. The choice of method depends on whether any false positive is intolerable (FWER) or a small fraction is acceptable (FDR).

Test Your Understanding

  1. You test 100 features at α=0.05 and find 8 significant results. Before applying any correction, compute the expected number of false positives if all 100 null hypotheses were true. After Bonferroni correction (α=0.0005), how many of the original 8 remain significant? What does this trade-off cost you?

  2. A team argues: "We pre-specified our primary metric (accuracy) before the experiment, so we don't need multiple testing corrections." They also evaluated 4 secondary metrics (precision, recall, F1, AUC) informally after the experiment. Is their claim correct? What should they do?

  3. Bonferroni and Holm both control FWER, but Holm is always at least as powerful. Explain why Holm is more powerful by describing what happens in the Holm procedure when the first k p-values all satisfy p₍ₖ₎ < α/(m−k+1). What does Bonferroni do instead?

  4. You use BH at q=0.05 and reject 12 features out of 200 tested. A stakeholder asks: "How many of those 12 features are truly associated with the target?" What is the BH guarantee? Why can't you identify which specific features are false discoveries?

  5. A dataset has 1,000 features. You apply BH at q=0.05 and reject 40 features. You then test those 40 features on a new validation set at α=0.05. Why is applying BH once on the original 1,000 tests and then using the 40 features in a new analysis more appropriate than re-applying correction on the 40 re-tests?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment