Back to blog
← View series: statistics

~/blog

Hypothesis Testing

Apr 11, 20267 min readBy mohammed.vasim
StatisticsMathData Science

You ran an A/B test: version A of your recommendation algorithm served 5,000 users and produced a mean click-through rate of 3.2%, while version B served another 5,000 users and produced 3.7%. Version B looks better. But is that 0.5-percentage-point difference real, or is it just random fluctuation you would expect even if both algorithms performed identically?

This is the question hypothesis testing exists to answer. It forces you to be precise about what "real" means by asking: if there were truly no difference, how surprising would this data be? You do not prove things directly — you rule out the "nothing is happening" explanation indirectly.

The Two-Algorithm Dataset

Throughout this post, the dataset is:

  • Version A: , mean CTR , observed clicks
  • Version B: , mean CTR , observed clicks
  • Observed difference:

The Setup

Every hypothesis test starts with two competing claims:

The Null Hypothesis () is the status quo: "nothing special is happening." For the A/B test: — both algorithms perform identically.

The Alternative Hypothesis () is what you suspect is true: — the algorithms differ in CTR.

You collect data, calculate a test statistic, and ask: if were true, how surprising would this data be? If it is very surprising, you reject .

One-Tailed vs Two-Tailed: Choose Before Seeing Data

This choice must be made before you look at results.

Two-tailed (): You care about any difference — version B could be better or worse. Use this when you have no prior reason to expect a specific direction. It is the conservative, default choice for A/B tests because you do not know in advance which direction the effect will go.

One-tailed (): You care only about version B being better. Use this only when (1) the opposite direction is impossible or irrelevant, and (2) you specified this before collecting data. Using one-tailed tests after you see the direction of the effect is a form of p-hacking.

For this A/B test, use two-tailed: you want to know if the algorithms differ, not just if B is better.

The Six-Step Procedure

Step 1: State hypotheses and choose tail direction.

(two-tailed)

Significance level:

Step 2: Choose the appropriate test.

Both groups have large and binary outcomes. Use a two-proportion z-test. The pooled proportion under is:

Step 3: Calculate the test statistic.

Step 4: Find the critical value and rejection region.

For two-tailed: .

Reject if .

Step 5: Calculate the p-value.

Step 6: Decision and interpretation.

and : fail to reject . The observed CTR difference is not statistically significant at . The data is consistent with random variation.

PhaseFormulaValuesResult
Pooled proportion
Standard error
Z statistic
p-value (two-tailed)$2 \times P(Z >Z_{obs})$

Rejection Region and p-Value SVGs

The rejection region is the set of Z values where you would reject . The observed statistic sits just inside the non-rejection zone.

0 -1.96 +1.96 Z=1.938 2.5% 2.5% just inside Shaded red = rejection region (alpha=0.05 two-tailed) | amber = observed Z

The p-value is the probability of observing Z at least as extreme as 1.938 under .

0 -1.938 +1.938 2.63% 2.63% p-value = 2 x 2.63% = 5.26% (amber shading = p-value area, both tails)

Power: Can This Test Detect a Real Effect?

Power is the probability of correctly rejecting when a real difference exists. For the A/B test with per group, , and true difference :

With , the effect corresponds to standard errors. The non-centrality parameter is 1.938.

Power

The test has only 49% power for this effect size — essentially a coin flip for detecting a real 0.5pp difference. To achieve 80% power, you would need roughly per group.

Python Code

python
import numpy as np
from scipy import stats

# A/B test data
clicks_a, n_a = 160, 5000
clicks_b, n_b = 185, 5000

p_a = clicks_a / n_a
p_b = clicks_b / n_b
p_pool = (clicks_a + clicks_b) / (n_a + n_b)

se = np.sqrt(p_pool * (1 - p_pool) * (1/n_a + 1/n_b))
z_stat = (p_b - p_a) / se
p_value = 2 * (1 - stats.norm.cdf(abs(z_stat)))

print(f"CTR A: {p_a:.4f}, CTR B: {p_b:.4f}")
print(f"Pooled proportion: {p_pool:.4f}")
print(f"Standard error: {se:.5f}")
print(f"Z statistic: {z_stat:.4f}")
print(f"p-value: {p_value:.4f}")
print(f"Reject H0: {p_value < 0.05}")
CTR A: 0.0320, CTR B: 0.0370 Pooled proportion: 0.0345 Standard error: 0.00258 Z statistic: 1.9380 p-value: 0.0526 Reject H0: False

Common Mistakes

Testing after seeing data: Never define your hypotheses or choose one-tailed vs two-tailed after looking at results. This invalidates the inference.

Equating statistical significance with practical significance: A 0.005 CTR difference might be economically meaningful at scale even if . Calculate effect size and business impact separately.

Confusing "fail to reject" with "accept ": Not rejecting means the data is consistent with no difference — it does not mean no difference exists.

Ignoring power: The A/B test above has only 49% power. An underpowered study is ethically questionable — you are spending resources on a study unlikely to detect real effects.

Multiple testing without correction: Running 20 A/B tests simultaneously at gives roughly 1 false positive even if no algorithm actually differs.

Hypothesis testing builds on the CLT (post 1) to justify using Normal or t distributions for test statistics. It uses estimation concepts (post 2) for computing standard errors. The p-value mechanics are expanded in post 4, and the specific test statistics for means and proportions are developed in posts 5 through 8. Type I and Type II errors (post 9) formalize the tradeoffs introduced here with power analysis. Confidence intervals (post 11) are the complement: everything this test rejects at level is outside the confidence interval.

Honest Limitations

Hypothesis testing answers one specific question: is this data consistent with the null hypothesis? It says nothing about why an effect exists, whether it will replicate, or whether it matters. An A/B test that narrowly fails to reject might still be worth acting on if the effect size is large enough and the cost of inaction is high. Statistical significance is a tool, not a decision-making algorithm.

Test Your Understanding

  1. In the A/B test above, if you increase the sample to per group with the same observed CTR rates, would the result change? Calculate the new Z statistic.
  2. A team argues they should use a one-tailed test because "we expect version B to be better." What information would you need to know before accepting this choice of test?
  3. The A/B test had 49% power for a true 0.5pp difference. What does this mean in plain language for a product manager who must decide whether to run the experiment?
  4. You run 10 A/B tests simultaneously at . Two come back significant. How many would you expect to be false positives if none of the algorithms actually differ?
  5. After seeing , a colleague rounds to "basically 5%" and declares the result significant. What is wrong with this reasoning?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment