Back to blog
← View series: statistics

~/blog

Type I and Type II Errors

Apr 11, 20268 min readBy mohammed.vasim
StatisticsMathData Science

Your A/B test declared a new recommendation algorithm significant at . You ship the feature. Two months later, an analyst reruns the experiment and finds no effect. You generated a false positive. Or: you ran an A/B test, found , and declared no difference. The feature sat on a shelf. A competitor shipped the same idea and gained market share. You missed a real effect. Both mistakes have costs. Understanding them — and the tradeoff between them — is how you design experiments that do not fail silently.

The Dataset

Throughout this post: you are testing a new ranking model in production. The null hypothesis is (same mean CTR). The alternative is (new model is better). True CTR difference, if any: (0.5 percentage points). Population CTR standard deviation: .

The Two Errors

is True is False
Reject Type I Error ()Correct (Power = )
Fail to Reject Correct ()Type II Error ()

Type I Error (False Positive): You conclude the new model is better when it is not. You ship a model that does not actually improve CTR. Cost: engineering time, potential harm from a bad model in production, erosion of experiment credibility.

Type II Error (False Negative): You conclude no difference when the new model is genuinely better. You leave a real improvement on the table. Cost: missed revenue, competitive disadvantage, wasted ML engineering investment.

Type I Error: Controlled by

You directly set — the probability of a false positive. With , you accept a 5% chance of shipping a model that is no better than the current one.

Lowering to 0.01 reduces false positives but requires stronger evidence to ship anything. This is appropriate for high-stakes decisions (major algorithm changes, product launches) but conservative for exploratory experimentation.

Type II Error: Depends on Power

is the probability of missing a real effect. You do not set directly — it is determined by:

  • Sample size : larger reduces
  • Effect size : larger effects are easier to detect
  • Significance level : higher gives more power but more false positives
  • Variability : less noisy data means lower

Power = = the probability of correctly detecting a real effect.

Numerical Calculation

For the ranking model test with one-tailed , , true difference , and per arm:

PhaseFormulaValuesResult
Standard error
Critical value (one-tailed)standard normal
Rejection threshold
Non-centrality
Power

With per arm, you have only 40% power. If the new model is genuinely 0.5pp better, you will miss it 60% of the time. This is an underpowered experiment.

The Fundamental Trade-off

For fixed and , lowering (fewer false positives) always raises (more false negatives), and vice versa. You cannot minimize both simultaneously with fixed resources.

The diagram below shows two distributions — centered at 0 and centered at 0.005 (the true effect). The one-tailed critical value at sits at 0.005886. Everything to the right of that line under is a Type I error. Everything to the left under is a Type II error. The green region is power.

H0 (delta=0) H1 (delta=0.005) Type I alpha = 0.05 Type II beta = 0.598 Power 1-beta = 0.402 critical threshold 0 0.00589 0.005

Designing for Proper Power

Before running the ranking model experiment, calculate the required sample size for 80% power ():

For 80% power to detect a 0.5pp CTR difference, you need roughly 3,165 requests per arm — more than three times what was initially planned.

python
import numpy as np
from scipy import stats

# Ranking model A/B test parameters
sigma = 0.08
delta = 0.005   # minimum detectable effect
alpha = 0.05    # one-tailed
target_power = 0.80

z_alpha = stats.norm.ppf(1 - alpha)
z_beta = stats.norm.ppf(target_power)

n_required = ((z_alpha + z_beta) ** 2 * 2 * sigma**2) / delta**2
print(f"Required n per arm: {int(np.ceil(n_required))}")

# Power at various n
for n in [500, 1000, 2000, 3165, 5000]:
    se = sigma * np.sqrt(2 / n)
    ncp = delta / se
    power = 1 - stats.norm.cdf(z_alpha - ncp)
    print(f"  n={n:5d}: power={power:.3f}")
Required n per arm: 3165 n= 500: power=0.228 n= 1000: power=0.402 n= 2000: power=0.637 n= 3165: power=0.800 n= 5000: power=0.917

Multiple Testing Problem

If you run 20 A/B tests simultaneously at — testing 20 different features — and none of them truly improve CTR:

Expected false positives:

You have a 64% chance of shipping at least one ineffective feature. Solutions:

Bonferroni: . Reduces false positives but lowers power for each individual test.

False Discovery Rate (FDR): Controls the expected proportion of false discoveries among significant results. More powerful than Bonferroni when running many tests.

Pre-registration: Specify which features are being tested and their primary hypothesis before data collection. Prevents selective reporting.

What Affects Error Rates

FactorIncreasing it does...
Sample size Reduces both error (if threshold fixed) and error
Effect size Larger effects: lower , easier to detect
Significance level Higher : lower , but more false positives
Variability Lower variability: lower

Three Things People Get Wrong

"." This is wrong. is the error probability under , is the error probability under . They are computed under different conditions and do not sum to anything meaningful. What sums to 1 is and separately.

"Type I error is always worse." Context determines this. For exploratory feature testing at an early product stage, Type II errors are often worse — missing real improvements is costly when you are iterating fast. For high-stakes decisions (safety systems, medical interventions), Type I errors may be worse.

"Power is the probability of a correct decision." Power is the probability of rejection given is true. Under , the probability of a correct decision is , not power.

Type I and Type II errors are the formalization of the hypothesis testing setup from post 3 and the p-value interpretation from post 4. Power analysis requires knowing the test statistic distribution (Z-test post 5, t-test posts 6-7), which determines the shape of both distributions in the diagram above. The multiple testing problem connects to FDR-controlled methods used when running large-scale feature experiments. ANOVA (post 14) generalizes this framework to comparing multiple groups simultaneously, where the "all-pairs" error inflation is even more severe.

Honest Limitations

Power calculations assume you know and in advance — which requires historical data or prior experiments. In practice, these are estimates, so your power calculation is itself uncertain. Running power analyses with optimistic effect sizes is a common source of chronically underpowered experiments. When historical variance estimates are unavailable, pilot studies or sequential designs (checking at interim points) are more honest approaches.

Test Your Understanding

  1. For the ranking model experiment, calculate the power when per arm and the true CTR difference is . Use and one-tailed.
  2. You tighten the significance threshold from to for a high-stakes ranking change. Holding everything else constant, does power increase or decrease? By how much (compute numerically for )?
  3. Twenty A/B tests are run. Three come back significant at . Using the Bonferroni correction, how many of these should be considered significant? What if you used the original threshold?
  4. A model experiment has 80% power for a 1% CTR lift. A colleague says "we are safe — we will catch 80% of real effects." What is missing from this statement that matters for experiment design decisions?
  5. The operations team asks: "Can we cut the experiment to 500 per arm to save resources?" For the ranking model test, what is the power at , and how would you explain the consequence of this cut to a non-statistician?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment