← View series: statistics
~/blog
P-Value
The p-value might be the most misunderstood statistic in existence. It appears in every A/B test report, every model evaluation, and virtually every scientific paper — yet professional data scientists routinely interpret it wrong. The misinterpretation is not subtle: people believe the p-value measures the probability that their finding is real. It does not.
The p-value answers exactly one question: if the null hypothesis were true, how surprising would this data be? That is all it tells you.
The Dataset
Returning to the A/B test from post 3: version A served 5,000 users with CTR 3.20%, version B served 5,000 users with CTR 3.70%. The Z statistic was 1.938. We will use this concrete example throughout.
The Formal Definition
The p-value is the probability of obtaining test results at least as extreme as the results actually observed, under the assumption that the null hypothesis is correct:
For our A/B test with (two-tailed):
This means: if version A and version B truly performed identically, there is a 5.26% chance of observing a CTR difference as large as 0.5pp or larger just from random sampling. The data is mildly surprising under , but not decisively so.
What P-Values Are NOT
These misconceptions appear constantly in data science practice:
1. The p-value is NOT the probability that is true.
This would be , which requires Bayesian inference. What we have is . By Bayes' theorem, . Even a small p-value can correspond to a high probability that is true, if the prior probability of an effect is low.
2. The p-value is NOT the probability that results are due to chance.
It tells you the probability of the data under the chance model, not whether chance produced the result.
3. The p-value is NOT a measure of effect size.
does not mean a larger CTR difference than . With large , even a 0.001 CTR difference produces tiny p-values. Effect size and p-value are related through sample size, not directly.
4. The p-value is NOT the probability of replication.
If a study produces , repeating it gives only about 50% of the time, even if the effect is real. Replication probability depends on power, not the p-value.
Calculating the P-Value: Six-Step Walkthrough
For the A/B test (two-tailed, ):
Step 1: Compute the pooled proportion under :
Step 2: Compute the standard error:
Step 3: Compute the Z statistic:
Step 4: Find the tail probability:
Step 5: Double for two-tailed:
Step 6: Compare to : , fail to reject .
| Phase | Formula | Values | Result |
|---|---|---|---|
| Standard error | |||
| Z statistic | |||
| One-tail area | $1 - \Phi( | Z | )$ |
| p-value (two-tailed) | one-tail area |
The Rejection Region and P-Value Visualized
For this A/B test with , the observed statistic sits just inside the non-rejection zone. The p-value is the total probability in both tails beyond the observed Z.
P-Value Distribution Under and
Under , p-values are uniformly distributed on . This means exactly of all p-values fall below by chance — which is the definition of Type I error. Under , p-values skew toward zero; the stronger the true effect and the larger the sample, the more concentrated they are near zero.
This uniform-under- property is useful: if you run many A/B tests and the p-value histogram is not uniform, something is wrong (p-hacking, data leakage, or optional stopping).
Power: What p = 0.0526 Also Tells You
The A/B test marginally failed significance. Does that mean version B is no better? Not necessarily. Power analysis (computed in post 3) showed only 49% power for a true 0.5pp difference. The test had a coin-flip chance of detecting the effect even if it were real. A p-value just above 0.05 with low power is weak evidence of no effect — it is mainly evidence of an underpowered experiment.
The Multiple Testing Problem
If you run 20 A/B tests across 20 different features simultaneously at , and all null hypotheses are true, you expect about 1 significant result purely by chance.
Solutions:
Bonferroni: Set where is the number of tests. For 20 tests: . Conservative but simple.
False Discovery Rate (FDR): Control the expected proportion of false discoveries among significant results. More powerful than Bonferroni when running many tests.
Pre-registration: Specify hypotheses before analyzing data, preventing selective reporting.
P-Values and Effect Size: They Are Not the Same
Statistical significance is not practical significance. Consider two scenarios comparing our A/B test CTR difference:
| Scenario | per group | True CTR diff | Z stat | p-value |
|---|---|---|---|---|
| Large experiment | 500,000 | 0.01% | 2.24 | 0.025 |
| Current test | 5,000 | 0.50% | 1.94 | 0.053 |
The large experiment achieves significance for a 0.01% CTR difference. The current test barely misses significance for a 0.5% difference. Which finding matters more for your product? The 0.5% difference at , clearly. Always report and consider effect sizes alongside p-values.
Python Code
import numpy as np
from scipy import stats
# A/B test: same data as post 3
clicks_a, n_a = 160, 5000
clicks_b, n_b = 185, 5000
p_a = clicks_a / n_a
p_b = clicks_b / n_b
p_pool = (clicks_a + clicks_b) / (n_a + n_b)
se = np.sqrt(p_pool * (1 - p_pool) * (1/n_a + 1/n_b))
z_obs = (p_b - p_a) / se
p_value = 2 * (1 - stats.norm.cdf(abs(z_obs)))
one_tail_area = 1 - stats.norm.cdf(abs(z_obs))
print(f"Z statistic: {z_obs:.4f}")
print(f"One-tail area: {one_tail_area:.4f}")
print(f"Two-tailed p-value: {p_value:.4f}")
print(f"Reject at alpha=0.05: {p_value < 0.05}")
# Demonstrate uniform distribution of p-values under H0
np.random.seed(42)
p_values_under_h0 = []
for _ in range(10000):
a = np.random.binomial(5000, 0.035) / 5000
b = np.random.binomial(5000, 0.035) / 5000
p_hat = (5000*a + 5000*b) / 10000
se_sim = np.sqrt(p_hat * (1-p_hat) * 2/5000)
z = (b - a) / se_sim if se_sim > 0 else 0
p_values_under_h0.append(2 * (1 - stats.norm.cdf(abs(z))))
below_alpha = np.mean(np.array(p_values_under_h0) < 0.05)
print(f"\nUnder H0, fraction of p-values below 0.05: {below_alpha:.4f} (should be ~0.05)")Z statistic: 1.9380
One-tail area: 0.0263
Two-tailed p-value: 0.0526
Reject at alpha=0.05: False
Under H0, fraction of p-values below 0.05: 0.0497 (should be ~0.05)
The Replication Crisis Connection
The reproducibility problems in psychology, medicine, and data science trace directly to p-value misuse. Common issues:
P-hacking: Running multiple analyses until something significant appears, then reporting only that analysis as if it were pre-planned.
HARKing: Hypothesizing After Results are Known — presenting post-hoc hypotheses as if pre-specified.
Publication bias: Journals preferentially publish significant results, creating a literature that overestimates effect sizes.
Optional stopping: Checking results after each batch of users and stopping when . This inflates Type I error to around 26% even with .
The American Statistical Association has released statements urging: do not use p-values as thresholds for truth, always consider context and study quality, and report effect sizes and confidence intervals alongside p-values.
Related Concepts
P-values connect back to the full hypothesis testing procedure (post 3) and the CLT (post 1) that justifies using the Normal distribution for test statistics. They connect forward to Type I and Type II errors (post 9), where the relationship between , , and power is developed numerically. Confidence intervals (post 11) provide the same information in a more useful form: rather than a binary "reject or not," an interval shows the range of plausible effect sizes. For categorical outcomes, the chi-square test (post 12) uses the same p-value logic applied to a different test statistic.
Honest Limitations
P-values are designed for one purpose: quantifying discrepancy between data and a null hypothesis. They cannot tell you whether a finding will replicate, whether an effect is practically meaningful, or whether your model is correctly specified. A single p-value from a single experiment is weak evidence. Replication, pre-registration, effect size reporting, and confidence intervals together are stronger evidence.
Test Your Understanding
- For the A/B test with , compute the one-tailed p-value for . Is it significant at ? Should you use this one-tailed test for your conclusion?
- A researcher runs 100 A/B tests, all genuinely testing null effects, at . How many significant results do you expect? If they publish only those results, what is the published false discovery rate?
- Two A/B tests compare different features: Test 1 has and a 0.02% CTR lift on . Test 2 has and a 1.5% CTR lift on . Which result is more actionable, and why?
- Explain in one sentence why , and give a scenario from model evaluation where this distinction matters.
- The p-value for a model comparison is . A colleague says this means the probability of a false positive is 0.1%. Is this correct? What does the p-value actually mean in this context?