← View series: statistics
~/blog
Z-Test vs t-Test
You are comparing two model versions. You run the experiment and face the first choice: Z-test or t-test? The short answer: use the t-test almost always. But understanding why — and exactly where the boundary lies — reveals something important about how uncertainty compounds in inference.
The Dataset
Two recommendation algorithm variants are compared on click-through rate (CTR). This is the same A/B test scenario used in posts 3 and 4, now viewed through the lens of which test is appropriate:
- Small pilot study: requests per arm, CTR_A = 8/12 = 0.667, CTR_B = 10/12 = 0.833
- Large production run: requests per arm (post 3's data)
For the pilot: the team must decide between Z and t. For the production run, the choice barely matters.
What Actually Differs
Both tests answer the same question — is the observed difference consistent with ? — but they make different assumptions:
| Feature | Z-test | t-test |
|---|---|---|
| Population SD () | Known | Unknown (estimated from ) |
| Distribution | Standard Normal | t-distribution |
| Critical values | Fixed (e.g., 1.96) | Depends on df |
| Tail weight | Lighter | Heavier (accounts for uncertainty) |
| When to use | known, or large | Almost always |
The distribution comparison shows why the choice matters at small :
The Key Insight: Unknown Variance Compounds Uncertainty
When you use in place of , you are estimating a random variable — itself varies from sample to sample. The t-distribution absorbs this extra uncertainty through heavier tails. With small samples, can be a poor estimate of , so the tails are very heavy. As grows, converges to , and the t-distribution converges to the standard normal.
Pilot Study: Where the Choice Matters
For the pilot with per arm, suppose you treat CTR as approximately Normal (not ideal for binary data, but illustrative):
Sample means: ,
Sample SDs: , (pooled: )
Using t (correct for ): . Since : fail to reject .
Using Z (incorrect): . Since : still fail to reject here — but the margin is tighter and for slightly larger observed differences, Z would incorrectly reject.
| Phase | Formula | Values | Result |
|---|---|---|---|
| Pooled std | |||
| Standard error | |||
| t statistic | |||
| Decision | $ | t | t_{crit}(df=22)=2.074$ |
The Error Inflation at Small n
The table below quantifies what happens if you use Z critical values when you should use t:
| Sample Size | df | t-critical | Z-critical | True Type I Error using Z |
|---|---|---|---|---|
| 5 | 4 | 2.776 | 1.96 | ~10% instead of 5% |
| 10 | 9 | 2.262 | 1.96 | ~7% instead of 5% |
| 20 | 19 | 2.093 | 1.96 | ~6% instead of 5% |
| 30 | 29 | 2.045 | 1.96 | ~5.5% instead of 5% |
| 100 | 99 | 1.984 | 1.96 | ~5.1% instead of 5% |
With , using Z doubles your false positive rate. With , the difference is negligible.
Power Analysis for Choosing Between Tests
For the production run ( per arm), both tests produce nearly identical results. For the pilot ( per arm), the t-test has meaningfully less power due to wider critical values. If you need 80% power to detect a 0.1-unit CTR difference with :
Your pilot with 12 per arm has only about 12% power for this effect. The Z vs t question becomes moot if the study is severely underpowered.
Python Code
import numpy as np
from scipy import stats
np.random.seed(42)
n_simulations = 10000
n_small = 10
mu = 0.70 # true CTR
sigma = 0.20
alpha = 0.05
z_crit = stats.norm.ppf(0.975)
t_crit_small = stats.t.ppf(0.975, n_small - 1)
rejections_z = 0
rejections_t = 0
for _ in range(n_simulations):
sample = np.random.normal(mu, sigma, n_small)
sample_mean = np.mean(sample)
sample_std = np.std(sample, ddof=1)
se = sample_std / np.sqrt(n_small)
z_stat_val = (sample_mean - mu) / se
t_stat_val = (sample_mean - mu) / se # same statistic, different critical value
if abs(z_stat_val) > z_crit:
rejections_z += 1
if abs(t_stat_val) > t_crit_small:
rejections_t += 1
print(f"Target Type I error: {alpha}")
print(f"Z-test Type I error: {rejections_z/n_simulations:.4f} (inflated)")
print(f"t-test Type I error: {rejections_t/n_simulations:.4f} (correct)")
print(f"\nZ critical value: {z_crit:.4f}")
print(f"t critical value (df={n_small-1}): {t_crit_small:.4f}")Target Type I error: 0.05
Z-test Type I error: 0.0742 (inflated)
t-test Type I error: 0.0513 (correct)
Z critical value: 1.9600
t critical value (df=9): 2.2622
Mathematical Relationship
As :
The convergence is fast: by , the difference in critical values is less than 5%. By , it is less than 2%. This is why large-sample textbooks sometimes use Z throughout — but the t-test costs nothing extra computationally and is always correct.
Decision Rule
When in doubt, use the t-test. It gives the correct answer whether you have 5 observations or 5 million. The Z-test is a special case of the t-test at .
Related Concepts
The Z vs t distinction builds on the t-distribution (post 6), which derives why replacing with requires heavier tails. It resolves a question left open by the Z-test (post 5): what happens when is not known? Post 7 shows the t-test in all three variants. The same reasoning behind Welch's adjusted degrees of freedom — that unequal variance reduces effective information — applies in ANOVA (posts 14–16) when comparing multiple groups. Power analysis (post 9) shows that this choice also affects how many samples you need for a given effect size.
Honest Limitations
Even the t-test has assumptions: independence of observations, and approximate normality (relaxed by CLT for large ). For proportions and counts, neither Z nor t is ideal for small samples — use exact binomial tests or Fisher's exact test. For severely non-Normal data at any sample size, consider bootstrap confidence intervals or non-parametric tests (Mann-Whitney U, Wilcoxon signed-rank) rather than either Z or t.
Test Your Understanding
- For the pilot study ( per arm), compute the exact Type I error probability if Z critical values are used with df = 22. Use the t-distribution's CDF to find .
- A team has CTR observations per arm with , . They want to use a pooled t-test. What is the key assumption being made, and is it likely to hold given the observed standard deviations?
- You know from a year of production logs that CTR has population standard deviation . You are testing a new arm with observations. Are you justified in using a Z-test, and why does the answer depend on more than just knowing ?
- Both Z and t give critical value for the production run (). A colleague says "so it doesn't matter which we use." Is this always true, or are there scenarios at large where the choice still matters?
- Calculate how many more samples you need in the pilot (beyond per arm) to bring the pilot's power above 80% for a true 0.15-unit CTR difference, using and .