← View series: statistics
~/blog
Statistical Power and Sample Size
The 6-fold CV experiment completed. The test finds p=0.09 and the team shelves the model. But the model might genuinely be 2 points better — the experiment simply never had enough data to find out. Power analysis prevents this: it tells you how many observations you need before running the experiment, not after.
The Four Quantities
Every hypothesis test is governed by four interconnected quantities. Knowing three determines the fourth:
- α (Type I error rate): probability of rejecting H₀ when it is true. Set by the researcher (convention: 0.05).
- Power = 1 − β: probability of correctly rejecting H₀ when H₁ is true. Target: 0.80 or higher.
- δ (effect size): the minimum difference worth detecting. Smaller δ → harder to detect → requires more data.
- n (sample size): number of observations per group.
The Anchor
Planning to detect a real improvement of δ=0.02 in model accuracy over baseline μ₀=0.85:
σ = 0.048 # SD of CV fold accuracy (from pilot 6-fold data)
δ = 0.02 # minimum meaningful improvement to detect
α = 0.05 # significance level (two-tailed)
power = 0.80 # target: detect the effect 80% of the time
Sample Size Formula: One-Sample t-Test
n = ((z_α/2 + z_β) × σ / δ)²
Components:
- z_α/2 = Φ⁻¹(0.975) = 1.960 (two-tailed at α=0.05)
- z_β = Φ⁻¹(0.80) = 0.842 (80% power)
- σ = 0.048, δ = 0.02
Step-by-step:
| Step | Formula | Result |
|---|---|---|
| Sum of z-scores | z_α/2 + z_β = 1.960 + 0.842 | 2.802 |
| Squared | (2.802)² | 7.851 |
| σ/δ ratio | 0.048 / 0.02 | 2.400 |
| (σ/δ)² | (2.400)² | 5.760 |
| n = | 7.851 × 5.760 | 45.2 → round up to 46 |
Conclusion: 46 CV fold accuracy scores are needed to detect a 0.02 improvement with 80% power at α=0.05.
Current data has n=6 → dramatically underpowered. Power with n=6: ncp = δ × √n / σ = 0.02 × √6 / 0.048 = 0.02 × 2.449 / 0.048 = 1.020 Power ≈ Φ(1.020 − 1.960) = Φ(−0.940) = 0.174 — only 17.4% chance of detecting the real improvement.
Power Curves
The Square Law
Halving the detectable effect quadruples the required sample size:
n = (z_α/2 + z_β)² × (σ/δ)² ∝ 1/δ²
| δ | σ/δ | n required |
|---|---|---|
| 0.04 | 1.2 | 12 |
| 0.03 | 1.6 | 21 |
| 0.02 | 2.4 | 46 |
| 0.01 | 4.8 | 182 |
From δ=0.02 to δ=0.01: n grows from 46 to 182 — exactly 4×. This is fundamental: smaller effects require disproportionately more data to detect reliably.
Two-Sample t-Test (Comparing Two Models)
For comparing two models (each with n folds):
n_per_group = 2 × ((z_α/2 + z_β) × s_pooled / δ)²
The factor 2 accounts for needing n observations in each group (2n total). With equal SDs:
n_per_group = 2 × 7.851 × (0.048/0.02)² = 2 × 7.851 × 5.76 = 90.5 → 91 per group
Comparing two models requires roughly twice as many folds as testing one model against a fixed baseline. With n=6 per model (12 total folds), the power for δ=0.02 is only:
ncp = δ × √(n/2) / σ = 0.02 × √3 / 0.048 = 0.721 Power ≈ 12.4% — critically underpowered.
Proportions: Cohen's h
For binary outcomes (classification accuracy as proportion), Cohen's h transforms the proportions before computing the effect size:
h = 2 × arcsin(√p₂) − 2 × arcsin(√p₁)
n = (z_α/2 + z_β)² / h²
Example: p₁=0.85 (baseline), p₂=0.87 (new model).
h = 2 × arcsin(√0.87) − 2 × arcsin(√0.85) = 2 × 1.1972 − 2 × 1.1731 = 2.3944 − 2.3462 = 0.0482
n = (2.802)² / (0.0482)² = 7.851 / 0.00232 = 3,384 per group
A 2pp improvement from 85% to 87% requires over 3,000 samples per group to detect with 80% power. This is why large test sets matter in production model evaluation.
Winner's Curse in Underpowered Studies
If a study is underpowered (power=0.30) and finds a significant result, the observed effect is biased upward. The reasoning: for a significant result to occur, the observed effect must be large enough to cross the critical threshold — this selects for atypically large estimates of the true effect.
Concretely: if the true δ=0.02 and the study has power=0.17 (n=6), a significant result would only occur when the observed x̄ > baseline + t* × SE. The estimate x̄ that crosses this bar overestimates the true 0.02 improvement on average.
Practical consequence: underpowered studies that find p<0.05 tend not to replicate at the same magnitude. The effect shrinks toward its true value in larger follow-up studies. This explains a large fraction of the replication crisis.
Sensitivity Analysis
Power analysis gives a target, not a single definitive number. Always explore the space:
| δ \ Power | 70% | 80% | 90% | 95% |
|---|---|---|---|---|
| 0.04 | 8 | 12 | 17 | 22 |
| 0.03 | 14 | 21 | 29 | 38 |
| 0.02 | 32 | 46 | 65 | 85 |
| 0.01 | 126 | 182 | 258 | 339 |
(n values for one-sample test, σ=0.048, α=0.05)
Report the entire row that matches your minimum effect size. If n=46 is impractical, show stakeholders that accepting δ=0.04 (only care about larger improvements) reduces the requirement to n=12.
Code and Output
import numpy as np
from statsmodels.stats.power import TTestPower, TTestIndPower
from scipy import stats
sigma = 0.048
delta = 0.02
alpha = 0.05
# One-sample: effect_size in units of sigma (Cohen's d)
cohen_d = delta / sigma
analysis = TTestPower()
n_required = analysis.solve_power(effect_size=cohen_d, alpha=alpha, power=0.80, alternative='two-sided')
print(f"Cohen's d = δ/σ = {cohen_d:.4f}")
print(f"Required n (one-sample, power=0.80): {np.ceil(n_required):.0f}")
power_n6 = analysis.solve_power(effect_size=cohen_d, alpha=alpha, nobs=6, alternative='two-sided')
print(f"Power with n=6: {power_n6:.4f} (only {power_n6*100:.1f}%)")
# Two-sample comparison
analysis_2 = TTestIndPower()
n_two = analysis_2.solve_power(effect_size=cohen_d, alpha=alpha, power=0.80, alternative='two-sided')
print(f"\nRequired n per group (two-sample): {np.ceil(n_two):.0f}")
# Proportions (Cohen's h)
p1, p2 = 0.85, 0.87
h = 2 * np.arcsin(np.sqrt(p2)) - 2 * np.arcsin(np.sqrt(p1))
z_alpha = stats.norm.ppf(0.975)
z_beta = stats.norm.ppf(0.80)
n_prop = (z_alpha + z_beta)**2 / h**2
print(f"\nProportions: p1={p1}, p2={p2}")
print(f"Cohen's h = {abs(h):.4f}")
print(f"Required n per group: {np.ceil(n_prop):.0f}")
# Sensitivity analysis: n for different (δ, power) combinations
print("\nSensitivity analysis (n for one-sample, σ=0.048, α=0.05):")
print(f"{'δ':>6} | {'70%':>6} | {'80%':>6} | {'90%':>6} | {'95%':>6}")
for d in [0.04, 0.03, 0.02, 0.01]:
ns = []
for pwr in [0.70, 0.80, 0.90, 0.95]:
n = analysis.solve_power(effect_size=d/sigma, alpha=alpha, power=pwr, alternative='two-sided')
ns.append(int(np.ceil(n)))
print(f"{d:>6.2f} | {ns[0]:>6} | {ns[1]:>6} | {ns[2]:>6} | {ns[3]:>6}")
# Power curve (manual, for illustration)
print("\nPower at various n (δ=0.02, σ=0.048, α=0.05):")
for n in [6, 10, 20, 30, 46, 60, 100]:
pwr = analysis.solve_power(effect_size=cohen_d, alpha=alpha, nobs=n, alternative='two-sided')
print(f" n={n:>4}: power={pwr:.4f} ({'✓ adequate' if pwr >= 0.80 else '✗ underpowered'})")Cohen's d = δ/σ = 0.4167
Required n (one-sample, power=0.80): 46.0
Power with n=6: 0.1727 (only 17.3%)
Required n per group (two-sample): 91.0
Proportions: p1=0.85, p2=0.87
Cohen's h = 0.0482
Required n per group: 3384.0
Sensitivity analysis (n for one-sample, σ=0.048, α=0.05):
δ | 70% | 80% | 90% | 95%
0.04 | 8 | 12 | 17 | 22
0.03 | 14 | 21 | 29 | 38
0.02 | 32 | 46 | 65 | 85
0.01 | 126 | 182 | 258 | 339
Power at various n (δ=0.02, σ=0.048, α=0.05):
n= 6: power=0.1727 (✗ underpowered)
n= 10: power=0.2466 (✗ underpowered)
n= 20: power=0.4076 (✗ underpowered)
n= 30: power=0.5526 (✗ underpowered)
n= 46: power=0.8013 (✓ adequate)
n= 60: power=0.8982 (✓ adequate)
n= 100: power=0.9861 (✓ adequate)
ANOVA Power
For one-way ANOVA comparing k groups: power depends on Cohen's f (f = η/√(1−η²), derived from η²). For k groups each with n observations:
Non-centrality parameter λ = n × k × f²
Power = P(F(k−1, N−k, λ) > F_critical)
Use scipy.stats.ncf (non-central F distribution) or statsmodels.stats.power.FTestAnovaPower for computation. The same principle applies: specify α, f, and target power → compute n per group.
Test Your Understanding
-
You plan to detect a 0.02 accuracy improvement with 80% power at α=0.05, and calculate n=46. Your team can only run 25 folds. Compute the actual power for n=25. What are three strategies to increase power without increasing n?
-
The required sample size scales as n ∝ (σ/δ)². If you reduce the detectable effect from δ=0.02 to δ=0.01, how does n change? If you simultaneously improve your evaluation methodology to reduce σ from 0.048 to 0.030, what is the net change in n?
-
A published paper detects a significant effect with n=8 (power≈15%). Your team plans a replication with n=25 (power≈35%). Explain why: (a) the original study's significant result likely overestimates the true effect, and (b) your replication has a high probability of failing to replicate even if the true effect is real.
-
For binary outcomes, Cohen's h = 0.048 (detecting p₁=0.85 vs p₂=0.87) requires n=3,384 per group. Explain why this is so much larger than the continuous accuracy case (n=46). What property of proportions near 0.85 makes them harder to distinguish than continuous values?
-
You run a power analysis and find n=46 for 80% power. You run the experiment with n=50 and get p=0.06. A stakeholder wants to run 50 more folds and check again. Explain the statistical problem with this approach (sequential testing) and name one method that addresses it.