Back to blog
← View series: statistics

~/blog

Statistical Power and Sample Size

May 16, 202610 min readBy Mohammed Vasim
StatisticsMathData Science

The 6-fold CV experiment completed. The test finds p=0.09 and the team shelves the model. But the model might genuinely be 2 points better — the experiment simply never had enough data to find out. Power analysis prevents this: it tells you how many observations you need before running the experiment, not after.

The Four Quantities

Every hypothesis test is governed by four interconnected quantities. Knowing three determines the fourth:

  1. α (Type I error rate): probability of rejecting H₀ when it is true. Set by the researcher (convention: 0.05).
  2. Power = 1 − β: probability of correctly rejecting H₀ when H₁ is true. Target: 0.80 or higher.
  3. δ (effect size): the minimum difference worth detecting. Smaller δ → harder to detect → requires more data.
  4. n (sample size): number of observations per group.
Four quantities — fix any three, the fourth is determined α (Type I) set by you δ (effect) minimum care n (sample) determined Power=1−β target 0.80

The Anchor

Planning to detect a real improvement of δ=0.02 in model accuracy over baseline μ₀=0.85:

σ = 0.048 # SD of CV fold accuracy (from pilot 6-fold data) δ = 0.02 # minimum meaningful improvement to detect α = 0.05 # significance level (two-tailed) power = 0.80 # target: detect the effect 80% of the time

Sample Size Formula: One-Sample t-Test

n = ((z_α/2 + z_β) × σ / δ)²

Components:

  • z_α/2 = Φ⁻¹(0.975) = 1.960 (two-tailed at α=0.05)
  • z_β = Φ⁻¹(0.80) = 0.842 (80% power)
  • σ = 0.048, δ = 0.02

Step-by-step:

StepFormulaResult
Sum of z-scoresz_α/2 + z_β = 1.960 + 0.8422.802
Squared(2.802)²7.851
σ/δ ratio0.048 / 0.022.400
(σ/δ)²(2.400)²5.760
n =7.851 × 5.76045.2 → round up to 46

Conclusion: 46 CV fold accuracy scores are needed to detect a 0.02 improvement with 80% power at α=0.05.

Current data has n=6 → dramatically underpowered. Power with n=6: ncp = δ × √n / σ = 0.02 × √6 / 0.048 = 0.02 × 2.449 / 0.048 = 1.020 Power ≈ Φ(1.020 − 1.960) = Φ(−0.940) = 0.174 — only 17.4% chance of detecting the real improvement.

Power Curves

Power vs n (left) and Power vs δ (right) — anchor: σ=0.048, α=0.05 80% n=6 17% n=46 0% 50% 80% 0 100 200 n (sample size) 80% δ=0.02 0% 80% 0 0.05 0.10 δ (effect size), n=50 fixed

The Square Law

Halving the detectable effect quadruples the required sample size:

n = (z_α/2 + z_β)² × (σ/δ)² ∝ 1/δ²

δσ/δn required
0.041.212
0.031.621
0.022.446
0.014.8182

From δ=0.02 to δ=0.01: n grows from 46 to 182 — exactly 4×. This is fundamental: smaller effects require disproportionately more data to detect reliably.

Two-Sample t-Test (Comparing Two Models)

For comparing two models (each with n folds):

n_per_group = 2 × ((z_α/2 + z_β) × s_pooled / δ)²

The factor 2 accounts for needing n observations in each group (2n total). With equal SDs:

n_per_group = 2 × 7.851 × (0.048/0.02)² = 2 × 7.851 × 5.76 = 90.5 → 91 per group

Comparing two models requires roughly twice as many folds as testing one model against a fixed baseline. With n=6 per model (12 total folds), the power for δ=0.02 is only:

ncp = δ × √(n/2) / σ = 0.02 × √3 / 0.048 = 0.721 Power ≈ 12.4% — critically underpowered.

Proportions: Cohen's h

For binary outcomes (classification accuracy as proportion), Cohen's h transforms the proportions before computing the effect size:

h = 2 × arcsin(√p₂) − 2 × arcsin(√p₁)

n = (z_α/2 + z_β)² / h²

Example: p₁=0.85 (baseline), p₂=0.87 (new model).

h = 2 × arcsin(√0.87) − 2 × arcsin(√0.85) = 2 × 1.1972 − 2 × 1.1731 = 2.3944 − 2.3462 = 0.0482

n = (2.802)² / (0.0482)² = 7.851 / 0.00232 = 3,384 per group

A 2pp improvement from 85% to 87% requires over 3,000 samples per group to detect with 80% power. This is why large test sets matter in production model evaluation.

Winner's Curse in Underpowered Studies

If a study is underpowered (power=0.30) and finds a significant result, the observed effect is biased upward. The reasoning: for a significant result to occur, the observed effect must be large enough to cross the critical threshold — this selects for atypically large estimates of the true effect.

Concretely: if the true δ=0.02 and the study has power=0.17 (n=6), a significant result would only occur when the observed x̄ > baseline + t* × SE. The estimate x̄ that crosses this bar overestimates the true 0.02 improvement on average.

Practical consequence: underpowered studies that find p<0.05 tend not to replicate at the same magnitude. The effect shrinks toward its true value in larger follow-up studies. This explains a large fraction of the replication crisis.

Underpowered (left) vs adequate power (right) — same true effect δ n=6, power=17% z* β=83% n=46, power=80% z* β=20% Power=80% With n=46, the null and alt distributions separate enough to reliably detect δ=0.02

Sensitivity Analysis

Power analysis gives a target, not a single definitive number. Always explore the space:

δ \ Power70%80%90%95%
0.048121722
0.0314212938
0.0232466585
0.01126182258339

(n values for one-sample test, σ=0.048, α=0.05)

Report the entire row that matches your minimum effect size. If n=46 is impractical, show stakeholders that accepting δ=0.04 (only care about larger improvements) reduces the requirement to n=12.

Code and Output

python
import numpy as np
from statsmodels.stats.power import TTestPower, TTestIndPower
from scipy import stats

sigma = 0.048
delta = 0.02
alpha = 0.05

# One-sample: effect_size in units of sigma (Cohen's d)
cohen_d = delta / sigma
analysis = TTestPower()

n_required = analysis.solve_power(effect_size=cohen_d, alpha=alpha, power=0.80, alternative='two-sided')
print(f"Cohen's d = δ/σ = {cohen_d:.4f}")
print(f"Required n (one-sample, power=0.80): {np.ceil(n_required):.0f}")

power_n6 = analysis.solve_power(effect_size=cohen_d, alpha=alpha, nobs=6, alternative='two-sided')
print(f"Power with n=6: {power_n6:.4f}  (only {power_n6*100:.1f}%)")

# Two-sample comparison
analysis_2 = TTestIndPower()
n_two = analysis_2.solve_power(effect_size=cohen_d, alpha=alpha, power=0.80, alternative='two-sided')
print(f"\nRequired n per group (two-sample): {np.ceil(n_two):.0f}")

# Proportions (Cohen's h)
p1, p2 = 0.85, 0.87
h = 2 * np.arcsin(np.sqrt(p2)) - 2 * np.arcsin(np.sqrt(p1))
z_alpha = stats.norm.ppf(0.975)
z_beta  = stats.norm.ppf(0.80)
n_prop = (z_alpha + z_beta)**2 / h**2
print(f"\nProportions: p1={p1}, p2={p2}")
print(f"Cohen's h = {abs(h):.4f}")
print(f"Required n per group: {np.ceil(n_prop):.0f}")

# Sensitivity analysis: n for different (δ, power) combinations
print("\nSensitivity analysis (n for one-sample, σ=0.048, α=0.05):")
print(f"{'δ':>6} | {'70%':>6} | {'80%':>6} | {'90%':>6} | {'95%':>6}")
for d in [0.04, 0.03, 0.02, 0.01]:
    ns = []
    for pwr in [0.70, 0.80, 0.90, 0.95]:
        n = analysis.solve_power(effect_size=d/sigma, alpha=alpha, power=pwr, alternative='two-sided')
        ns.append(int(np.ceil(n)))
    print(f"{d:>6.2f} | {ns[0]:>6} | {ns[1]:>6} | {ns[2]:>6} | {ns[3]:>6}")

# Power curve (manual, for illustration)
print("\nPower at various n (δ=0.02, σ=0.048, α=0.05):")
for n in [6, 10, 20, 30, 46, 60, 100]:
    pwr = analysis.solve_power(effect_size=cohen_d, alpha=alpha, nobs=n, alternative='two-sided')
    print(f"  n={n:>4}: power={pwr:.4f}  ({'✓ adequate' if pwr >= 0.80 else '✗ underpowered'})")
Cohen's d = δ/σ = 0.4167 Required n (one-sample, power=0.80): 46.0 Power with n=6: 0.1727 (only 17.3%) Required n per group (two-sample): 91.0 Proportions: p1=0.85, p2=0.87 Cohen's h = 0.0482 Required n per group: 3384.0 Sensitivity analysis (n for one-sample, σ=0.048, α=0.05): δ | 70% | 80% | 90% | 95% 0.04 | 8 | 12 | 17 | 22 0.03 | 14 | 21 | 29 | 38 0.02 | 32 | 46 | 65 | 85 0.01 | 126 | 182 | 258 | 339 Power at various n (δ=0.02, σ=0.048, α=0.05): n= 6: power=0.1727 (✗ underpowered) n= 10: power=0.2466 (✗ underpowered) n= 20: power=0.4076 (✗ underpowered) n= 30: power=0.5526 (✗ underpowered) n= 46: power=0.8013 (✓ adequate) n= 60: power=0.8982 (✓ adequate) n= 100: power=0.9861 (✓ adequate)

ANOVA Power

For one-way ANOVA comparing k groups: power depends on Cohen's f (f = η/√(1−η²), derived from η²). For k groups each with n observations:

Non-centrality parameter λ = n × k × f²

Power = P(F(k−1, N−k, λ) > F_critical)

Use scipy.stats.ncf (non-central F distribution) or statsmodels.stats.power.FTestAnovaPower for computation. The same principle applies: specify α, f, and target power → compute n per group.

Test Your Understanding

  1. You plan to detect a 0.02 accuracy improvement with 80% power at α=0.05, and calculate n=46. Your team can only run 25 folds. Compute the actual power for n=25. What are three strategies to increase power without increasing n?

  2. The required sample size scales as n ∝ (σ/δ)². If you reduce the detectable effect from δ=0.02 to δ=0.01, how does n change? If you simultaneously improve your evaluation methodology to reduce σ from 0.048 to 0.030, what is the net change in n?

  3. A published paper detects a significant effect with n=8 (power≈15%). Your team plans a replication with n=25 (power≈35%). Explain why: (a) the original study's significant result likely overestimates the true effect, and (b) your replication has a high probability of failing to replicate even if the true effect is real.

  4. For binary outcomes, Cohen's h = 0.048 (detecting p₁=0.85 vs p₂=0.87) requires n=3,384 per group. Explain why this is so much larger than the continuous accuracy case (n=46). What property of proportions near 0.85 makes them harder to distinguish than continuous values?

  5. You run a power analysis and find n=46 for 80% power. You run the experiment with n=50 and get p=0.06. A stakeholder wants to run 50 more folds and check again. Explain the statistical problem with this approach (sequential testing) and name one method that addresses it.

Comments (0)

No comments yet. Be the first to comment!

Leave a comment