~/blog

Statistical Power and Sample Size

May 16, 2026•10 min read•By Mohammed Vasim

StatisticsMathData Science

The 6-fold CV experiment completed. The test finds p=0.09 and the team shelves the model. But the model might genuinely be 2 points better — the experiment simply never had enough data to find out. Power analysis prevents this: it tells you how many observations you need before running the experiment, not after.

The Four Quantities

Every hypothesis test is governed by four interconnected quantities. Knowing three determines the fourth:

α (Type I error rate): probability of rejecting H₀ when it is true. Set by the researcher (convention: 0.05).
Power = 1 − β: probability of correctly rejecting H₀ when H₁ is true. Target: 0.80 or higher.
δ (effect size): the minimum difference worth detecting. Smaller δ → harder to detect → requires more data.
n (sample size): number of observations per group.

The Anchor

Planning to detect a real improvement of δ=0.02 in model accuracy over baseline μ₀=0.85:

σ = 0.048    # SD of CV fold accuracy (from pilot 6-fold data)
δ = 0.02     # minimum meaningful improvement to detect
α = 0.05     # significance level (two-tailed)
power = 0.80 # target: detect the effect 80% of the time

Sample Size Formula: One-Sample t-Test

n = ((z_α/2 + z_β) × σ / δ)²

Components:

z_α/2 = Φ⁻¹(0.975) = 1.960 (two-tailed at α=0.05)
z_β = Φ⁻¹(0.80) = 0.842 (80% power)
σ = 0.048, δ = 0.02

Step-by-step:

Step	Formula	Result
Sum of z-scores	z_α/2 + z_β = 1.960 + 0.842	2.802
Squared	(2.802)²	7.851
σ/δ ratio	0.048 / 0.02	2.400
(σ/δ)²	(2.400)²	5.760
n =	7.851 × 5.760	45.2 → round up to 46

Conclusion: 46 CV fold accuracy scores are needed to detect a 0.02 improvement with 80% power at α=0.05.

Current data has n=6 → dramatically underpowered. Power with n=6: ncp = δ × √n / σ = 0.02 × √6 / 0.048 = 0.02 × 2.449 / 0.048 = 1.020 Power ≈ Φ(1.020 − 1.960) = Φ(−0.940) = 0.174 — only 17.4% chance of detecting the real improvement.

Power Curves

The Square Law

Halving the detectable effect quadruples the required sample size:

n = (z_α/2 + z_β)² × (σ/δ)² ∝ 1/δ²

δ	σ/δ	n required
0.04	1.2	12
0.03	1.6	21
0.02	2.4	46
0.01	4.8	182

From δ=0.02 to δ=0.01: n grows from 46 to 182 — exactly 4×. This is fundamental: smaller effects require disproportionately more data to detect reliably.

Two-Sample t-Test (Comparing Two Models)

For comparing two models (each with n folds):

n_per_group = 2 × ((z_α/2 + z_β) × s_pooled / δ)²

The factor 2 accounts for needing n observations in each group (2n total). With equal SDs:

n_per_group = 2 × 7.851 × (0.048/0.02)² = 2 × 7.851 × 5.76 = 90.5 → 91 per group

Comparing two models requires roughly twice as many folds as testing one model against a fixed baseline. With n=6 per model (12 total folds), the power for δ=0.02 is only:

ncp = δ × √(n/2) / σ = 0.02 × √3 / 0.048 = 0.721 Power ≈ 12.4% — critically underpowered.

Proportions: Cohen's h

For binary outcomes (classification accuracy as proportion), Cohen's h transforms the proportions before computing the effect size:

h = 2 × arcsin(√p₂) − 2 × arcsin(√p₁)

n = (z_α/2 + z_β)² / h²

Example: p₁=0.85 (baseline), p₂=0.87 (new model).

h = 2 × arcsin(√0.87) − 2 × arcsin(√0.85) = 2 × 1.1972 − 2 × 1.1731 = 2.3944 − 2.3462 = 0.0482

n = (2.802)² / (0.0482)² = 7.851 / 0.00232 = 3,384 per group

A 2pp improvement from 85% to 87% requires over 3,000 samples per group to detect with 80% power. This is why large test sets matter in production model evaluation.

Winner's Curse in Underpowered Studies

If a study is underpowered (power=0.30) and finds a significant result, the observed effect is biased upward. The reasoning: for a significant result to occur, the observed effect must be large enough to cross the critical threshold — this selects for atypically large estimates of the true effect.

Concretely: if the true δ=0.02 and the study has power=0.17 (n=6), a significant result would only occur when the observed x̄ > baseline + t* × SE. The estimate x̄ that crosses this bar overestimates the true 0.02 improvement on average.

Practical consequence: underpowered studies that find p<0.05 tend not to replicate at the same magnitude. The effect shrinks toward its true value in larger follow-up studies. This explains a large fraction of the replication crisis.

Sensitivity Analysis

Power analysis gives a target, not a single definitive number. Always explore the space:

δ \ Power	70%	80%	90%	95%
0.04	8	12	17	22
0.03	14	21	29	38
0.02	32	46	65	85
0.01	126	182	258	339

(n values for one-sample test, σ=0.048, α=0.05)

Report the entire row that matches your minimum effect size. If n=46 is impractical, show stakeholders that accepting δ=0.04 (only care about larger improvements) reduces the requirement to n=12.

Code and Output

python

import numpy as np
from statsmodels.stats.power import TTestPower, TTestIndPower
from scipy import stats

sigma = 0.048
delta = 0.02
alpha = 0.05

# One-sample: effect_size in units of sigma (Cohen's d)
cohen_d = delta / sigma
analysis = TTestPower()

n_required = analysis.solve_power(effect_size=cohen_d, alpha=alpha, power=0.80, alternative='two-sided')
print(f"Cohen's d = δ/σ = {cohen_d:.4f}")
print(f"Required n (one-sample, power=0.80): {np.ceil(n_required):.0f}")

power_n6 = analysis.solve_power(effect_size=cohen_d, alpha=alpha, nobs=6, alternative='two-sided')
print(f"Power with n=6: {power_n6:.4f}  (only {power_n6*100:.1f}%)")

# Two-sample comparison
analysis_2 = TTestIndPower()
n_two = analysis_2.solve_power(effect_size=cohen_d, alpha=alpha, power=0.80, alternative='two-sided')
print(f"\nRequired n per group (two-sample): {np.ceil(n_two):.0f}")

# Proportions (Cohen's h)
p1, p2 = 0.85, 0.87
h = 2 * np.arcsin(np.sqrt(p2)) - 2 * np.arcsin(np.sqrt(p1))
z_alpha = stats.norm.ppf(0.975)
z_beta  = stats.norm.ppf(0.80)
n_prop = (z_alpha + z_beta)**2 / h**2
print(f"\nProportions: p1={p1}, p2={p2}")
print(f"Cohen's h = {abs(h):.4f}")
print(f"Required n per group: {np.ceil(n_prop):.0f}")

# Sensitivity analysis: n for different (δ, power) combinations
print("\nSensitivity analysis (n for one-sample, σ=0.048, α=0.05):")
print(f"{'δ':>6} | {'70%':>6} | {'80%':>6} | {'90%':>6} | {'95%':>6}")
for d in [0.04, 0.03, 0.02, 0.01]:
    ns = []
    for pwr in [0.70, 0.80, 0.90, 0.95]:
        n = analysis.solve_power(effect_size=d/sigma, alpha=alpha, power=pwr, alternative='two-sided')
        ns.append(int(np.ceil(n)))
    print(f"{d:>6.2f} | {ns[0]:>6} | {ns[1]:>6} | {ns[2]:>6} | {ns[3]:>6}")

# Power curve (manual, for illustration)
print("\nPower at various n (δ=0.02, σ=0.048, α=0.05):")
for n in [6, 10, 20, 30, 46, 60, 100]:
    pwr = analysis.solve_power(effect_size=cohen_d, alpha=alpha, nobs=n, alternative='two-sided')
    print(f"  n={n:>4}: power={pwr:.4f}  ({'✓ adequate' if pwr >= 0.80 else '✗ underpowered'})")

Cohen's d = δ/σ = 0.4167
Required n (one-sample, power=0.80): 46.0
Power with n=6: 0.1727  (only 17.3%)

Required n per group (two-sample): 91.0

Proportions: p1=0.85, p2=0.87
Cohen's h = 0.0482
Required n per group: 3384.0

Sensitivity analysis (n for one-sample, σ=0.048, α=0.05):
     δ |    70% |    80% |    90% |    95%
  0.04 |      8 |     12 |     17 |     22
  0.03 |     14 |     21 |     29 |     38
  0.02 |     32 |     46 |     65 |     85
  0.01 |    126 |    182 |    258 |    339

Power at various n (δ=0.02, σ=0.048, α=0.05):
  n=   6: power=0.1727  (✗ underpowered)
  n=  10: power=0.2466  (✗ underpowered)
  n=  20: power=0.4076  (✗ underpowered)
  n=  30: power=0.5526  (✗ underpowered)
  n=  46: power=0.8013  (✓ adequate)
  n=  60: power=0.8982  (✓ adequate)
  n= 100: power=0.9861  (✓ adequate)

ANOVA Power

For one-way ANOVA comparing k groups: power depends on Cohen's f (f = η/√(1−η²), derived from η²). For k groups each with n observations:

Non-centrality parameter λ = n × k × f²

Power = P(F(k−1, N−k, λ) > F_critical)

Use scipy.stats.ncf (non-central F distribution) or statsmodels.stats.power.FTestAnovaPower for computation. The same principle applies: specify α, f, and target power → compute n per group.

Test Your Understanding

You plan to detect a 0.02 accuracy improvement with 80% power at α=0.05, and calculate n=46. Your team can only run 25 folds. Compute the actual power for n=25. What are three strategies to increase power without increasing n?
The required sample size scales as n ∝ (σ/δ)². If you reduce the detectable effect from δ=0.02 to δ=0.01, how does n change? If you simultaneously improve your evaluation methodology to reduce σ from 0.048 to 0.030, what is the net change in n?
A published paper detects a significant effect with n=8 (power≈15%). Your team plans a replication with n=25 (power≈35%). Explain why: (a) the original study's significant result likely overestimates the true effect, and (b) your replication has a high probability of failing to replicate even if the true effect is real.
For binary outcomes, Cohen's h = 0.048 (detecting p₁=0.85 vs p₂=0.87) requires n=3,384 per group. Explain why this is so much larger than the continuous accuracy case (n=46). What property of proportions near 0.85 makes them harder to distinguish than continuous values?
You run a power analysis and find n=46 for 80% power. You run the experiment with n=50 and get p=0.06. A stakeholder wants to run 50 more folds and check again. Explain the statistical problem with this approach (sequential testing) and name one method that addresses it.