~/blog

P-Value

Apr 11, 2026•11 min read•By Mohammed Vasim

StatisticsMathData Science

The p-value might be the most misunderstood number in statistics. It appears in every A/B test, every model evaluation, and every research paper — yet practitioners routinely interpret it wrong. The misinterpretation is not subtle: people believe p-value measures the probability that a finding is real. It does not.

The p-value answers exactly one question: if the null hypothesis were true, how surprising would this data be? That is all it tells you.

The Anchor

An A/B test comparing two recommendation model versions:

text

Version A (control): 5,000 users, CTR = 3.20% (160 clicks)
Version B (treatment): 5,000 users, CTR = 3.70% (185 clicks)
Observed difference: 0.50 percentage points

Precise Definition

p-value = P(test statistic ≥ |Z_obs| | H₀ true)

For the anchor (two-tailed test, H₀: CTR_A = CTR_B):

p = P(|Z| ≥ 1.938 | CTR_A = CTR_B) = 2 × P(Z ≥ 1.938) = 2 × 0.0263 = 0.0526

In plain English: if versions A and B truly had the same CTR, there is a 5.26% chance of observing a 0.50pp difference or larger across two groups of 5,000 users. The data is mildly surprising under H₀, but not decisively so.

Step-by-Step Computation Pipeline

Step 1 — Pooled proportion under H₀: p̂_pool = (160 + 185) / (5000 + 5000) = 345 / 10000 = 0.0345

Step 2 — Standard error: SE = √(p̂_pool × (1 − p̂_pool) × (1/n_A + 1/n_B)) = √(0.0345 × 0.9655 × 2/5000) = √(0.0000066705) = 0.002582

Step 3 — Test statistic: Z = (p̂_B − p̂_A) / SE = (0.037 − 0.032) / 0.002582 = 0.005 / 0.002582 = 1.938

Step 4 — Tail probability: P(Z > 1.938) = 1 − Φ(1.938) = 1 − 0.9737 = 0.0263

Step 5 — Two-tailed p-value: p = 2 × 0.0263 = 0.0526

Step 6 — Decision: 0.0526 > 0.05 → fail to reject H₀

Step	Formula	Values	Result
Pooled proportion	(x_A + x_B)/(n_A+n_B)	(160+185)/10000	0.0345
Standard error	√(p̂(1−p̂)(1/n_A+1/n_B))	√(0.0345×0.9655×0.0004)	0.002582
Z statistic	(p̂_B−p̂_A)/SE	0.005/0.002582	1.938
One-tail area	1−Φ(	Z	)
p-value (two-tailed)	2 × one-tail	2 × 0.0263	0.0526

Distribution Visualization

The Five Misinterpretations — Stated and Corrected

Misinterpretation 1: "p = 0.0526 means there is a 5.26% probability that H₀ is true."

Wrong. p-value = P(data | H₀), not P(H₀ | data). Computing P(H₀ | data) requires Bayes' theorem and a prior on H₀. If you believe 90% of model tweaks produce no real improvement (P(H₀)=0.90), even p=0.03 might correspond to P(H₀ | data) = 0.60. The p-value is about the data given H₀; it is not about H₀ given the data.

Misinterpretation 2: "p = 0.0526 means the result is due to chance."

Wrong. p-value measures how surprising the data is if H₀ were true — it does not confirm that H₀ is true. With p = 0.0526, the result is somewhat surprising under H₀. But H₀ might still be true. The observed 0.50pp difference might also be a real, smaller effect that happens to fall just short of the threshold.

Misinterpretation 3: "p = 0.0526 > 0.05 means there is no effect."

Wrong. "Fail to reject H₀" is not "accept H₀." The test had only 49% power for a true 0.50pp difference (this experiment was underpowered). Failing to find significance is consistent with both "no effect" and "real effect, insufficient power." Absence of evidence is not evidence of absence.

Misinterpretation 4: "A smaller p-value means a larger effect."

Wrong. p-value depends on both effect size AND sample size. With n = 50,000 per arm and the same 0.50pp difference, Z ≈ 6.1 and p < 0.0001 — highly significant. The effect is identical; only the precision of estimation changed. Effect size (Cohen's h, d) measures magnitude. p-value measures evidence given n.

Misinterpretation 5: "p < 0.05 is the dividing line between real and not real."

Wrong. 0.05 is a historical convention (Fisher's informal suggestion). p = 0.049 and p = 0.051 are effectively the same evidence — do not treat them as categorically different. Some fields (particle physics) require p < 2.9×10⁻⁷ (5σ). Some exploratory ML work uses p < 0.10. Report exact p-values; state your pre-specified α and the rationale for it.

P-Value and Sample Size Dependency

The same 0.50pp CTR difference at different sample sizes:

n per arm	SE	Z	p-value	Decision (α=0.05)
500	0.00817	0.612	0.541	Fail to reject
1,000	0.00578	0.865	0.387	Fail to reject
2,000	0.00408	1.222	0.222	Fail to reject
5,000	0.00258	1.938	0.053	Fail to reject
10,000	0.00183	2.732	0.006	Reject
50,000	0.000817	6.118	<0.0001	Reject

The 0.50pp effect is identical in every row. Only n changes. p-value shrinks because SE ∝ 1/√n — larger n → smaller SE → Z grows → p shrinks. With large enough n, any nonzero effect becomes "statistically significant." This is why effect size is required.

Effect Size: Required Alongside p-value

The p-value for the anchor is 0.0526 (borderline). Is the CTR difference practically meaningful?

Cohen's h for proportions: h = 2 × arcsin(√p_B) − 2 × arcsin(√p_A)

h = 2 × arcsin(√0.037) − 2 × arcsin(√0.032) = 2 × 0.1932 − 2 × 0.1793 = 0.3864 − 0.3586 = 0.028

Interpretation: |h| = 0.028 is a very small effect (small=0.2, medium=0.5, large=0.8).

The 0.50pp CTR difference is statistically borderline AND practically small. Even if p < 0.05 with more data, a 0.5pp CTR lift may not justify engineering cost. Always pair the decision with effect size.

P-Values Under H₀: Uniform Distribution

Under H₀, p-values are uniformly distributed on [0, 1]. This means exactly α fraction of all p-values fall below α by chance — which is the definition of the Type I error rate. Under H₁, p-values concentrate near 0; the stronger the effect and the larger the sample, the more concentrated.

Practical implication: if you run many A/B tests and the p-value histogram is not flat (under H₀), something is wrong — p-hacking, data leakage, or optional stopping.

Multiple Testing: P-Value Inflation

Running 20 A/B tests simultaneously at α=0.05:

P(at least one false positive) = 1 − (1 − α)^k = 1 − 0.95^20 = 0.64

You expect at least one false positive 64% of the time — even when all null hypotheses are true.

Bonferroni correction: α_corrected = α/k = 0.05/20 = 0.0025. Conservative but simple. Use when k is small.

Benjamini-Hochberg (FDR): controls the expected proportion of false discoveries among significant results. More powerful than Bonferroni when many tests are run. The standard for ML feature selection and model comparison.

Misconceptions Reference Table

Misconception	What p-value actually is	Why the misconception is wrong
P(H₀ is true) = 0.0526	P(data	H₀)
The result is due to chance	How surprising data is under H₀	H₀ might still be true after p < 0.05
p ≥ 0.05 means no effect	Evidence threshold, not effect presence	May be underpowered; absence ≠ evidence
Smaller p = larger effect	p depends on n AND effect	Use Cohen's h/d for magnitude
0.05 is the truth threshold	Historical convention	Report exact p; state your α rationale

Code and Output

python

import numpy as np
from scipy import stats

n1, n2 = 5000, 5000
clicks_a, clicks_b = 160, 185
p_a, p_b = clicks_a / n1, clicks_b / n2

# Step by step
p_pool = (clicks_a + clicks_b) / (n1 + n2)
se = np.sqrt(p_pool * (1 - p_pool) * (1/n1 + 1/n2))
z_obs = (p_b - p_a) / se
p_value = 2 * (1 - stats.norm.cdf(abs(z_obs)))

print(f"p_A={p_a:.4f}, p_B={p_b:.4f}, diff={p_b-p_a:.4f}")
print(f"Pooled p̂: {p_pool:.4f}")
print(f"SE: {se:.6f}")
print(f"Z: {z_obs:.4f}")
print(f"p-value (two-tailed): {p_value:.4f}")

# Effect size: Cohen's h
h = 2*np.arcsin(np.sqrt(p_b)) - 2*np.arcsin(np.sqrt(p_a))
print(f"\nCohen's h: {abs(h):.4f}  (small=0.2, medium=0.5, large=0.8)")

# p-value vs n for same effect
print("\np-value at different sample sizes (same 0.50pp effect):")
for n in [500, 1000, 2000, 5000, 10000, 50000]:
    clicks_a_n = int(n * p_a)
    clicks_b_n = int(n * p_b)
    p_pool_n = (clicks_a_n + clicks_b_n) / (2*n)
    se_n = np.sqrt(p_pool_n * (1 - p_pool_n) * (2/n))
    z_n = (p_b - p_a) / se_n
    pv_n = 2 * (1 - stats.norm.cdf(abs(z_n)))
    sig = "** reject **" if pv_n < 0.05 else "fail to reject"
    print(f"  n={n:>6}: Z={z_n:.3f}, p={pv_n:.4f}  {sig}")

# Multiple testing
k = 20
alpha = 0.05
family_wise = 1 - (1 - alpha)**k
alpha_bonf = alpha / k
print(f"\nMultiple testing ({k} tests at α={alpha}):")
print(f"  P(at least one false positive): {family_wise:.3f}")
print(f"  Bonferroni correction: α = {alpha_bonf:.4f}")

# p-value uniform under H0
rng = np.random.default_rng(42)
p0_vals = []
for _ in range(10000):
    a_sim = rng.binomial(5000, 0.035) / 5000
    b_sim = rng.binomial(5000, 0.035) / 5000
    pp = (5000*a_sim + 5000*b_sim) / 10000
    se_sim = np.sqrt(pp * (1-pp) * 2/5000)
    z_sim = (b_sim - a_sim) / se_sim if se_sim > 0 else 0
    p0_vals.append(2 * (1 - stats.norm.cdf(abs(z_sim))))
print(f"\nUnder H₀, fraction of p-values below 0.05: {np.mean(np.array(p0_vals)<0.05):.4f}  (expected: 0.05)")