← View series: statistics
~/blog
Type I and Type II Errors
Your A/B test declared a new recommendation algorithm significant at . You ship the feature. Two months later, an analyst reruns the experiment and finds no effect. You generated a false positive. Or: you ran an A/B test, found , and declared no difference. The feature sat on a shelf. A competitor shipped the same idea and gained market share. You missed a real effect. Both mistakes have costs. Understanding them — and the tradeoff between them — is how you design experiments that do not fail silently.
The Dataset
Throughout this post: you are testing a new ranking model in production. The null hypothesis is (same mean CTR). The alternative is (new model is better). True CTR difference, if any: (0.5 percentage points). Population CTR standard deviation: .
The Two Errors
| is True | is False | |
|---|---|---|
| Reject | Type I Error () | Correct (Power = ) |
| Fail to Reject | Correct () | Type II Error () |
Type I Error (False Positive): You conclude the new model is better when it is not. You ship a model that does not actually improve CTR. Cost: engineering time, potential harm from a bad model in production, erosion of experiment credibility.
Type II Error (False Negative): You conclude no difference when the new model is genuinely better. You leave a real improvement on the table. Cost: missed revenue, competitive disadvantage, wasted ML engineering investment.
Type I Error: Controlled by
You directly set — the probability of a false positive. With , you accept a 5% chance of shipping a model that is no better than the current one.
Lowering to 0.01 reduces false positives but requires stronger evidence to ship anything. This is appropriate for high-stakes decisions (major algorithm changes, product launches) but conservative for exploratory experimentation.
Type II Error: Depends on Power
is the probability of missing a real effect. You do not set directly — it is determined by:
- Sample size : larger reduces
- Effect size : larger effects are easier to detect
- Significance level : higher gives more power but more false positives
- Variability : less noisy data means lower
Power = = the probability of correctly detecting a real effect.
Numerical Calculation
For the ranking model test with one-tailed , , true difference , and per arm:
| Phase | Formula | Values | Result |
|---|---|---|---|
| Standard error | |||
| Critical value | (one-tailed) | standard normal | |
| Rejection threshold | |||
| Non-centrality | |||
| Power |
With per arm, you have only 40% power. If the new model is genuinely 0.5pp better, you will miss it 60% of the time. This is an underpowered experiment.
The Fundamental Trade-off
For fixed and , lowering (fewer false positives) always raises (more false negatives), and vice versa. You cannot minimize both simultaneously with fixed resources.
The diagram below shows two distributions — centered at 0 and centered at 0.005 (the true effect). The one-tailed critical value at sits at 0.005886. Everything to the right of that line under is a Type I error. Everything to the left under is a Type II error. The green region is power.
Designing for Proper Power
Before running the ranking model experiment, calculate the required sample size for 80% power ():
For 80% power to detect a 0.5pp CTR difference, you need roughly 3,165 requests per arm — more than three times what was initially planned.
import numpy as np
from scipy import stats
# Ranking model A/B test parameters
sigma = 0.08
delta = 0.005 # minimum detectable effect
alpha = 0.05 # one-tailed
target_power = 0.80
z_alpha = stats.norm.ppf(1 - alpha)
z_beta = stats.norm.ppf(target_power)
n_required = ((z_alpha + z_beta) ** 2 * 2 * sigma**2) / delta**2
print(f"Required n per arm: {int(np.ceil(n_required))}")
# Power at various n
for n in [500, 1000, 2000, 3165, 5000]:
se = sigma * np.sqrt(2 / n)
ncp = delta / se
power = 1 - stats.norm.cdf(z_alpha - ncp)
print(f" n={n:5d}: power={power:.3f}")Required n per arm: 3165
n= 500: power=0.228
n= 1000: power=0.402
n= 2000: power=0.637
n= 3165: power=0.800
n= 5000: power=0.917
Multiple Testing Problem
If you run 20 A/B tests simultaneously at — testing 20 different features — and none of them truly improve CTR:
Expected false positives:
You have a 64% chance of shipping at least one ineffective feature. Solutions:
Bonferroni: . Reduces false positives but lowers power for each individual test.
False Discovery Rate (FDR): Controls the expected proportion of false discoveries among significant results. More powerful than Bonferroni when running many tests.
Pre-registration: Specify which features are being tested and their primary hypothesis before data collection. Prevents selective reporting.
What Affects Error Rates
| Factor | Increasing it does... |
|---|---|
| Sample size | Reduces both error (if threshold fixed) and error |
| Effect size | Larger effects: lower , easier to detect |
| Significance level | Higher : lower , but more false positives |
| Variability | Lower variability: lower |
Three Things People Get Wrong
"." This is wrong. is the error probability under , is the error probability under . They are computed under different conditions and do not sum to anything meaningful. What sums to 1 is and separately.
"Type I error is always worse." Context determines this. For exploratory feature testing at an early product stage, Type II errors are often worse — missing real improvements is costly when you are iterating fast. For high-stakes decisions (safety systems, medical interventions), Type I errors may be worse.
"Power is the probability of a correct decision." Power is the probability of rejection given is true. Under , the probability of a correct decision is , not power.
Related Concepts
Type I and Type II errors are the formalization of the hypothesis testing setup from post 3 and the p-value interpretation from post 4. Power analysis requires knowing the test statistic distribution (Z-test post 5, t-test posts 6-7), which determines the shape of both distributions in the diagram above. The multiple testing problem connects to FDR-controlled methods used when running large-scale feature experiments. ANOVA (post 14) generalizes this framework to comparing multiple groups simultaneously, where the "all-pairs" error inflation is even more severe.
Honest Limitations
Power calculations assume you know and in advance — which requires historical data or prior experiments. In practice, these are estimates, so your power calculation is itself uncertain. Running power analyses with optimistic effect sizes is a common source of chronically underpowered experiments. When historical variance estimates are unavailable, pilot studies or sequential designs (checking at interim points) are more honest approaches.
Test Your Understanding
- For the ranking model experiment, calculate the power when per arm and the true CTR difference is . Use and one-tailed.
- You tighten the significance threshold from to for a high-stakes ranking change. Holding everything else constant, does power increase or decrease? By how much (compute numerically for )?
- Twenty A/B tests are run. Three come back significant at . Using the Bonferroni correction, how many of these should be considered significant? What if you used the original threshold?
- A model experiment has 80% power for a 1% CTR lift. A colleague says "we are safe — we will catch 80% of real effects." What is missing from this statement that matters for experiment design decisions?
- The operations team asks: "Can we cut the experiment to 500 per arm to save resources?" For the ranking model test, what is the power at , and how would you explain the consequence of this cut to a non-statistician?