~/blog

Type I and Type II Errors

Apr 11, 2026•10 min read•By Mohammed Vasim

StatisticsMathData Science

Your A/B test declared p=0.03 and you shipped a new model. Two months later a replication found no effect — you generated a false positive. Or: your test found p=0.12 and you shelved the model. A competitor built the same thing and gained market share — you missed a real effect. Both mistakes have costs. Every experiment navigates the tradeoff between them.

The Anchor

text

A/B test: new ranking model vs. control
σ = 0.08         # CTR standard deviation (from historical data)
δ_true = 0.005   # true effect: 0.5pp CTR improvement (if real)
α = 0.05         # significance level
n = 1,000        # users per arm

The Error Matrix

Before any formulas: four possible outcomes depending on ground truth × decision.

	H₀ is True (no real effect)	H₀ is False (real effect = 0.5pp)
Reject H₀	Type I Error (false positive) — rate = α	Correct (true positive) — rate = Power = 1−β
Fail to Reject H₀	Correct (true negative) — rate = 1−α	Type II Error (false negative) — rate = β

In the anchor context:

True positive (power): real 0.5pp improvement detected → ship the model
False positive (Type I): no real improvement, test says yes → ship a useless model
True negative: no improvement, test says no → correct, don't ship
False negative (Type II): real 0.5pp improvement missed → model sits on shelf, competitor ships it

Type I Error: False Positive Rate α

Definition: reject H₀ when it is actually true. You conclude there is an effect when there is none.

Rate: you set α directly. With α=0.05, if the null is truly true and you repeat the experiment 100 times, you incorrectly reject ~5 times. This is the cost you accept for the ability to detect real effects.

Why α=0.05 is a convention, not a law:

α	Context
0.10	Exploratory analysis — missing real effects is more costly than false alarms
0.05	Standard ML experimentation — balanced trade-off
0.01	High-stakes decisions — major product changes, medical applications
2.9×10⁻⁷ (5σ)	Particle physics — a new particle claim requires overwhelming evidence

Relationship to critical value: α = P(|Z| > z* | H₀ true). For α=0.05 one-tailed: z* = 1.645.

Type II Error: False Negative Rate β

Definition: fail to reject H₀ when it is actually false. A real effect exists but the test misses it.

Rate: β is NOT set directly. It is determined by four factors:

n — larger sample → smaller β (more sensitive test)
δ — larger effect → smaller β (easier to detect)
σ — smaller variability → smaller β (cleaner signal)
α — larger α → smaller β (but more false positives)

Computing β for the anchor:

Under H₁, the test statistic Z has a non-central normal distribution with non-centrality parameter:

ncp = δ / SE = δ / (σ × √(2/n)) = 0.005 / (0.08 × √(2/1000)) = 0.005 / 0.003578 = 1.397

β = P(Z < z* | Z ~ N(ncp, 1)) = P(Z < 1.645 − 1.397) = P(Z < 0.248) = 0.598

Power = 1 − β = 0.402

With n=1,000 per arm, the test has only 40% power (one-tailed). It would miss the real 0.5pp improvement 60% of the time.

Statistical Power

Power = 1 − β = P(reject H₀ | H₁ true) = P(Z > z* | Z ~ N(ncp, 1))

Power formula (two-tailed): Power = Φ(−z* + ncp) + Φ(−z* − ncp)

For the anchor: Power = Φ(1.397 − 1.960) + Φ(−1.397 − 1.960) = Φ(−0.563) + Φ(−3.357) = 0.287 + 0.000 = 0.287 (two-tailed, n=1000)

Power vs n table:

n per arm	ncp	Power
500	0.988	16.6%
1,000	1.397	28.7%
2,000	1.975	50.6%
4,021	2.800	80%
6,000	3.422	92.7%

α–β Trade-off at Fixed n

Reducing α (stricter threshold) increases β (more missed effects) — for fixed n. The only way to reduce both simultaneously is to increase n.

α	z*	Power	β
0.01	2.576	15.2%	0.848
0.05	1.960	28.7%	0.713
0.10	1.645	40.2%	0.598
0.20	0.842	58.9%	0.411

Sample Size Calculation

Goal: n per arm to achieve target power at given δ, σ, α.

Formula: n = (z_α/2 + z_β)² × 2σ² / δ²

Where z_β = Φ⁻¹(1−β) (quantile for the target power).

Substitution for 80% power (z_β=0.842), α=0.05 (z_α/2=1.960):

n = (1.960 + 0.842)² × 2 × 0.08² / 0.005² = (2.802)² × 2 × 0.0064 / 0.000025 = 7.851 × 0.0128 / 0.000025 = 4,021 per arm

Sample size table for different target powers:

Target Power	z_β	n per arm
70%	0.524	2,457
80%	0.842	4,021
90%	1.282	6,093
95%	1.645	8,061

Statistical vs Practical Significance

Large-n inflation: n=500,000 per arm with δ=0.0001 (0.01pp). ncp=0.625 → Power≈7% — this tiny effect is hard to detect even with half a million users. But if a well-powered study finds p<0.05 for δ=0.0001, a 0.01pp CTR improvement has no practical value in most product contexts.

The rule: always report effect size alongside p-value. A significant result with negligible effect should not drive deployment decisions.

Reference Tables

Error types:

Error	Also Called	Rate	Who Controls	ML Consequence
Type I	False positive	α	You set directly	Ship a useless model
Type II	False negative	β	Determined by n, δ, σ, α	Miss a real improvement
Power	True positive rate	1−β	Increase n or δ	Detect real improvements

Power factors:

Factor	Increases Power When	Intuition
n	Larger	More data → smaller SE → larger ncp
δ	Larger	Bigger effect → easier to see
σ	Smaller	Less noise → cleaner signal
α	Larger	Looser threshold → rejects more often

Code and Output

python

import numpy as np
from scipy import stats

sigma = 0.08
delta = 0.005
alpha = 0.05
n = 1000

SE = sigma * np.sqrt(2 / n)
ncp = delta / SE
z_crit = stats.norm.ppf(1 - alpha/2)

power = stats.norm.cdf(-z_crit + ncp) + stats.norm.cdf(-z_crit - ncp)
beta = 1 - power

print(f"SE = {SE:.6f}")
print(f"z* = {z_crit:.3f},  ncp = δ/SE = {ncp:.4f}")
print(f"Power = {power:.4f},  β = {beta:.4f}")

print("\nα-β tradeoff (n=1000 per arm, δ=0.005, σ=0.08):")
for a in [0.01, 0.05, 0.10, 0.20]:
    z_a = stats.norm.ppf(1 - a/2)
    pwr = stats.norm.cdf(-z_a + ncp) + stats.norm.cdf(-z_a - ncp)
    print(f"  α={a:.2f}: z*={z_a:.3f}, power={pwr:.3f}, β={1-pwr:.3f}")

z_beta = stats.norm.ppf(0.80)
n_required = (z_crit + z_beta)**2 * 2 * sigma**2 / delta**2
print(f"\nRequired n per arm (power=0.80, α=0.05): {int(np.ceil(n_required))}")

print("\nSample size table:")
for target_power, label in [(0.70, "70%"), (0.80, "80%"), (0.90, "90%"), (0.95, "95%")]:
    z_b = stats.norm.ppf(target_power)
    n_req = (z_crit + z_b)**2 * 2 * sigma**2 / delta**2
    print(f"  Power={label}: n ≥ {int(np.ceil(n_req))}")

print("\nPower at different n (δ=0.005, σ=0.08, α=0.05):")
for n_test in [500, 1000, 2000, 4021, 6000]:
    se_t = sigma * np.sqrt(2 / n_test)
    ncp_t = delta / se_t
    pwr_t = stats.norm.cdf(-z_crit + ncp_t) + stats.norm.cdf(-z_crit - ncp_t)
    print(f"  n={n_test:>5}: ncp={ncp_t:.3f}, power={pwr_t:.3f}")