A/B Testing

StatisticsMathData Science

Your new recommendation model shows 14% click-through rate (CTR) in offline evaluation, versus 12% for the current model. Can you ship it? Not yet — offline evaluation doesn't tell you what happens when real users interact with the live system. An A/B test does. And there is a specific reason you cannot simply observe which model users happened to see: without randomized assignment, you cannot distinguish the model's effect from the effect of which users saw it.

What A/B Testing Is (and What It Is Not)

An A/B test is a randomized controlled experiment: users are randomly assigned to group A (control, existing model) or group B (treatment, new model). Random assignment is what makes causal inference possible — it eliminates confounders. Users in groups A and B are statistically identical in expectation on every measured and unmeasured characteristic.

The causal claim: if B outperforms A in a well-designed A/B test, you can say B causes higher CTR — not just that it correlates with it. An observational study (comparing users who happened to use model A vs model B) cannot support this claim.

What A/B testing requires:

Random assignment — not self-selection or convenience
Independence between units — one user's assignment doesn't affect another's
One primary metric pre-specified before data collection
Sample size determined before the experiment starts

The Anchor

python

# Current model (A): CTR = 0.12 (12%)
# New model (B): expected CTR = 0.14 (14%)
p_A = 0.12
p_B = 0.14
mde = p_B - p_A  # minimum detectable effect = 0.02 (2 percentage points)
alpha = 0.05     # Type I error rate
power = 0.80     # 80% power (β = 0.20)

Complete Workflow

Every A/B test has seven phases. Define all of them before running any statistics.

Hypothesis and primary metric: what is being tested and which single metric determines success?
Sample size calculation: how many users are needed to detect the MDE with sufficient power?
Experiment duration: from daily traffic, how many days to run?
Randomization: randomly assign users, check for sample ratio mismatch (SRM)
Data collection: record the primary metric for all assigned users
Statistical test: z-test for proportions (large n), compute z-statistic and p-value
Decision: reject or fail to reject H₀; verify guardrail metrics pass

Phase 1: Hypotheses and MDE

H₀: CTR_B − CTR_A = 0 (no difference) H₁: CTR_B > CTR_A (one-tailed, we only ship if B is better)

Minimum Detectable Effect (MDE) = 0.02: this is not a statistical choice — it is a business judgment. "Would we ship a model that only improves CTR by 0.001? At what improvement does deployment become worthwhile?" The MDE is the threshold where the improvement justifies the engineering and deployment cost. Set it before the experiment.

Pre-specification is mandatory. The direction of the test (one-tailed vs two-tailed) and the primary metric must be committed to before any data is collected. Switching to two-tailed after seeing results, or adding metrics after seeing which ones moved, inflates Type I error.

Phase 2: Sample Size Calculation

For comparing two proportions (click or no click is binary):

n = (z_α + z_β)² × [p_A(1−p_A) + p_B(1−p_B)] / (p_B − p_A)²

Step by step (one-tailed test, α=0.05, power=0.80):

z_α = 1.645 (one-tailed), z_β = 0.842 → (1.645 + 0.842)² = (2.487)² = 6.185
Variance sum: p_A(1−p_A) + p_B(1−p_B) = 0.12×0.88 + 0.14×0.86 = 0.1056 + 0.1204 = 0.2260
Denominator: (0.14 − 0.12)² = (0.02)² = 0.0004
n = 6.185 × 0.2260 / 0.0004 = 1.3978 / 0.0004 = 3,494 per group → ~7,000 total

For two-tailed (z_α/2 = 1.96): n = (1.96 + 0.842)² × 0.2260 / 0.0004 = 7.851 × 0.2260 / 0.0004 = 4,433 per group → 8,866 total

Experiment duration: if daily traffic allows 1,000 users per group: 4,433 / 1,000 = 4.5 days → round up to 5 days minimum.

Never shorten the experiment because "it looks significant already" — that is peeking.

Phase 3: Randomization

Unit of randomization: the user. A user ID is hashed to deterministically assign the user to A or B, consistently across sessions.

Stratified randomization: if mobile users have lower CTR than desktop users, and groups A and B might receive different mobile/desktop ratios by chance, stratify — assign users within each platform proportionally. This reduces variance.

Sample Ratio Mismatch (SRM) check: after the experiment, verify n_A ≈ n_B. If you targeted 50/50 and got 4,000 vs 5,500, something is wrong — broken logging, selection bias in assignment, or a bug in the randomization code. Never analyze an experiment with SRM. Use a chi-square test:

python

from scipy import stats

n_a, n_b = 4418, 4415
n_total = n_a + n_b
expected = n_total / 2
chi2_srm = (n_a - expected)**2/expected + (n_b - expected)**2/expected
p_srm = 1 - stats.chi2.cdf(chi2_srm, df=1)
print(f"SRM check: chi2={chi2_srm:.4f}, p={p_srm:.4f}")
# p >> 0.05 → no SRM detected → proceed to analysis

SRM check: chi2=0.0010, p=0.9745

p=0.97 → no evidence of SRM. Groups are balanced as expected.

Phase 4: Statistical Test

After collecting data: n_A=4,418 users, 530 clicks (CTR=12.00%); n_B=4,415 users, 618 clicks (CTR=14.00%).

For large-n proportion comparison, use the z-test for two proportions with a pooled standard error:

p̂_pooled = (clicks_A + clicks_B) / (n_A + n_B) = (530 + 618) / 8833 = 1148/8833 = 0.1300

SE = √(p̂_pooled × (1 − p̂_pooled) × (1/n_A + 1/n_B)) = √(0.1300 × 0.8700 × (1/4418 + 1/4415)) = √(0.1131 × 0.000453) = √0.0000513 = 0.00716

z = (p̂_B − p̂_A) / SE = (0.1400 − 0.1200) / 0.00716 = 0.0200 / 0.00716 = 2.793

p-value (one-tailed) = P(Z > 2.793) = 0.0026

z=2.793 >> z*=1.645 → Reject H₀. The new model causes significantly higher CTR at the 5% level.

95% confidence interval for the difference: (p̂_B − p̂_A) ± 1.96 × SE = 0.0200 ± 1.96 × 0.00716 = 0.0200 ± 0.0140

CI = [0.0060, 0.0340] (from 0.6pp to 3.4pp improvement)

The interval does not include 0, consistent with rejecting H₀. The point estimate is 2.0pp, with uncertainty spanning 0.6pp to 3.4pp.

Common Mistakes (All Six)

1. Peeking (optional stopping): checking significance during the experiment and stopping when p < 0.05. If you look at results daily for 10 days and stop when significant, your actual Type I error rate can exceed 30% — even at the nominal 5% threshold. Solution: pre-specify duration and look at significance only at the end.

2. Multiple metrics as primary: "If CTR doesn't improve, we'll check revenue. If not revenue, session length..." This is multiple testing without correction. Solution: pick ONE primary metric before the experiment. Track others as secondary or guardrail metrics only.

3. Novelty effect: users click more on anything new. An effect that appears in week 1 may vanish by week 3 as users adapt. Solution: run the experiment long enough for novelty to decay. If the decision is deployment, aim for at least two weeks.

4. Sample ratio mismatch: n_A ≠ n_B when you targeted 50/50 — indicates broken randomization or logging bug. Solution: run the SRM check before any analysis. If p_SRM < 0.05, investigate the root cause before trusting results.

5. Underpowered experiments: running for too few users, then reporting "no significant difference" as if it means the models perform equally. Underpowered = inconclusive, not null. Solution: compute sample size before the experiment. If you stopped early for external reasons, report the achieved power.

6. Simpson's paradox: mobile users have lower CTR, and if group B accidentally got more mobile users, B may appear worse even if it's better on both platforms. Solution: always segment results by major confounders (device type, geography, user tenure) and check that the overall result is consistent with within-segment results.

Rule: if any of these apply, the experiment results cannot be trusted.

Sequential Testing (Brief)

When you must look at results frequently — high-stakes decisions, safety-critical features — use sequential testing instead of fixed-horizon testing:

Alpha spending functions: O'Brien-Fleming boundary spends very little alpha early (requires very strong evidence to stop), spends more alpha late. Pocock boundary spends alpha uniformly — easier to stop early, but requires a larger final critical value.
Always-valid inference: methods like the mixture Sequential Probability Ratio Test (mSPRT) control error rates at any stopping time. The p-value is always valid regardless of when you look.

Sequential tests cost some power compared to a fixed-horizon test — you need slightly more data to achieve the same power at the pre-specified final sample size. This is the price of looking early.

Guardrail Metrics

Beyond the primary metric, define guardrail metrics before the experiment — metrics that must not degrade:

Page latency must stay under 200ms (p95)
Error rate must stay under 0.1%
Revenue per user must not decrease more than 1%

Decision rule: ship the new model ONLY if:

Primary metric (CTR) improves significantly, AND
All guardrail metrics hold within pre-specified thresholds.

A CTR improvement that causes a 50ms latency increase or elevated error rate is not a win. Guardrail metrics prevent optimizing a proxy at the expense of the actual user experience.

Full Code

python

import numpy as np
from scipy import stats

# Anchor
p_A, p_B = 0.12, 0.14
alpha, power = 0.05, 0.80

# Sample size (two-tailed)
z_alpha2 = stats.norm.ppf(1 - alpha/2)   # 1.96
z_beta   = stats.norm.ppf(power)          # 0.842
var_sum  = p_A*(1-p_A) + p_B*(1-p_B)
n_per_group = int(np.ceil((z_alpha2 + z_beta)**2 * var_sum / (p_B - p_A)**2))
print(f"Sample size: {n_per_group} per group, {2*n_per_group} total")
print(f"Duration at 1000/day per group: {n_per_group/1000:.1f} days")

# SRM check
n_a, n_b = 4418, 4415
n_total = n_a + n_b
expected = n_total / 2
chi2_srm = (n_a - expected)**2/expected + (n_b - expected)**2/expected
p_srm = 1 - stats.chi2.cdf(chi2_srm, df=1)
print(f"\nSRM check: chi2={chi2_srm:.4f}, p={p_srm:.4f}")

# Z-test for proportions
clicks_a, clicks_b = 530, 618
p_hat_a = clicks_a / n_a
p_hat_b = clicks_b / n_b
p_pooled = (clicks_a + clicks_b) / (n_a + n_b)
se = np.sqrt(p_pooled * (1 - p_pooled) * (1/n_a + 1/n_b))
z_stat = (p_hat_b - p_hat_a) / se
p_value_one = 1 - stats.norm.cdf(z_stat)

print(f"\np_hat_A={p_hat_a:.4f}, p_hat_B={p_hat_b:.4f}")
print(f"p_pooled={p_pooled:.4f}, SE={se:.5f}")
print(f"z={z_stat:.4f}, p (one-tailed)={p_value_one:.4f}")

# 95% CI for the difference
diff = p_hat_b - p_hat_a
ci_lower = diff - 1.96*se
ci_upper = diff + 1.96*se
print(f"Observed difference: {diff:.4f} ({diff*100:.2f}pp)")
print(f"95% CI for difference: [{ci_lower:.4f}, {ci_upper:.4f}]")

Sample size: 4433 per group, 8866 total
Duration at 1000/day per group: 4.4 days

SRM check: chi2=0.0010, p=0.9745

p_hat_A=0.1200, p_hat_B=0.1400
p_pooled=0.1300, SE=0.00716
z=2.7933, p (one-tailed)=0.0026
Observed difference: 0.0200 (2.00pp)
95% CI for difference: [0.0060, 0.0340]

Test Your Understanding

You see p=0.032 after 3 days of a planned 7-day experiment. The effect looks exactly as expected. A PM asks you to call it now. Explain what happens to the actual Type I error rate if you stop here, and what the experiment was designed to achieve at 7 days.
The primary metric (CTR) improves by 2pp (p=0.003), but page load time increases by 80ms (p=0.001), exceeding your 50ms guardrail threshold. The PM argues: "The CTR improvement is significant, let's ship." Construct the counter-argument and explain why guardrail violations are disqualifying.
Model B improves CTR on desktop by 3pp (p=0.001) but decreases CTR on mobile by 1pp (p=0.08). The overall test shows p=0.04 in favor of B. What is happening (hint: Simpson's paradox), and what is the correct decision? What does "ship B" mean for mobile users?
You ran a 5-day experiment with 3,000 users per group and observed no significant improvement (p=0.21). A director declares "Model B is no better than A." Why is this conclusion not supported by the data? Compute the power of this experiment if the true effect is 2pp, p_A=0.12, α=0.05.
An A/B testing framework offers "continuous monitoring with always-valid p-values." A colleague says this solves the peeking problem entirely — you can stop whenever you want. Is this correct? What does the always-valid property guarantee, and what does it cost compared to a fixed-horizon test?

A/B Testing

What A/B Testing Is (and What It Is Not)

The Anchor

Complete Workflow

Phase 1: Hypotheses and MDE

Phase 2: Sample Size Calculation

Phase 3: Randomization

Phase 4: Statistical Test

Common Mistakes (All Six)

Sequential Testing (Brief)

Guardrail Metrics

Full Code

Test Your Understanding

Comments (0)

Leave a comment