Back to blog
← View series: statistics

~/blog

Z-Test vs t-Test

Apr 11, 20268 min readBy mohammed.vasim
StatisticsMathData Science

You are comparing two model versions. You run the experiment and face the first choice: Z-test or t-test? The short answer: use the t-test almost always. But understanding why — and exactly where the boundary lies — reveals something important about how uncertainty compounds in inference.

The Dataset

Two recommendation algorithm variants are compared on click-through rate (CTR). This is the same A/B test scenario used in posts 3 and 4, now viewed through the lens of which test is appropriate:

  • Small pilot study: requests per arm, CTR_A = 8/12 = 0.667, CTR_B = 10/12 = 0.833
  • Large production run: requests per arm (post 3's data)

For the pilot: the team must decide between Z and t. For the production run, the choice barely matters.

What Actually Differs

Both tests answer the same question — is the observed difference consistent with ? — but they make different assumptions:

FeatureZ-testt-test
Population SD ()KnownUnknown (estimated from )
DistributionStandard Normalt-distribution
Critical valuesFixed (e.g., 1.96)Depends on df
Tail weightLighterHeavier (accounts for uncertainty)
When to use known, or largeAlmost always

The distribution comparison shows why the choice matters at small :

x f(x) -3 -2 2 3 Standard Normal (Z) t-distribution (df=4, n=5) t-distribution (df=30) t-crit=2.776 (df=4) z-crit=1.96 Heavy tails at low df push critical values outward — using Z would be too liberal

The Key Insight: Unknown Variance Compounds Uncertainty

When you use in place of , you are estimating a random variable — itself varies from sample to sample. The t-distribution absorbs this extra uncertainty through heavier tails. With small samples, can be a poor estimate of , so the tails are very heavy. As grows, converges to , and the t-distribution converges to the standard normal.

Pilot Study: Where the Choice Matters

For the pilot with per arm, suppose you treat CTR as approximately Normal (not ideal for binary data, but illustrative):

Sample means: ,

Sample SDs: , (pooled: )

Using t (correct for ): . Since : fail to reject .

Using Z (incorrect): . Since : still fail to reject here — but the margin is tighter and for slightly larger observed differences, Z would incorrectly reject.

PhaseFormulaValuesResult
Pooled std
Standard error
t statistic
Decision$tt_{crit}(df=22)=2.074$

The Error Inflation at Small n

The table below quantifies what happens if you use Z critical values when you should use t:

Sample Sizedft-criticalZ-criticalTrue Type I Error using Z
542.7761.96~10% instead of 5%
1092.2621.96~7% instead of 5%
20192.0931.96~6% instead of 5%
30292.0451.96~5.5% instead of 5%
100991.9841.96~5.1% instead of 5%

With , using Z doubles your false positive rate. With , the difference is negligible.

Power Analysis for Choosing Between Tests

For the production run ( per arm), both tests produce nearly identical results. For the pilot ( per arm), the t-test has meaningfully less power due to wider critical values. If you need 80% power to detect a 0.1-unit CTR difference with :

Your pilot with 12 per arm has only about 12% power for this effect. The Z vs t question becomes moot if the study is severely underpowered.

Python Code

python
import numpy as np
from scipy import stats

np.random.seed(42)
n_simulations = 10000
n_small = 10
mu = 0.70  # true CTR
sigma = 0.20
alpha = 0.05

z_crit = stats.norm.ppf(0.975)
t_crit_small = stats.t.ppf(0.975, n_small - 1)

rejections_z = 0
rejections_t = 0

for _ in range(n_simulations):
    sample = np.random.normal(mu, sigma, n_small)
    sample_mean = np.mean(sample)
    sample_std = np.std(sample, ddof=1)
    se = sample_std / np.sqrt(n_small)

    z_stat_val = (sample_mean - mu) / se
    t_stat_val = (sample_mean - mu) / se  # same statistic, different critical value

    if abs(z_stat_val) > z_crit:
        rejections_z += 1
    if abs(t_stat_val) > t_crit_small:
        rejections_t += 1

print(f"Target Type I error: {alpha}")
print(f"Z-test Type I error: {rejections_z/n_simulations:.4f}  (inflated)")
print(f"t-test Type I error: {rejections_t/n_simulations:.4f}  (correct)")
print(f"\nZ critical value: {z_crit:.4f}")
print(f"t critical value (df={n_small-1}): {t_crit_small:.4f}")
Target Type I error: 0.05 Z-test Type I error: 0.0742 (inflated) t-test Type I error: 0.0513 (correct) Z critical value: 1.9600 t critical value (df=9): 2.2622

Mathematical Relationship

As :

The convergence is fast: by , the difference in critical values is less than 5%. By , it is less than 2%. This is why large-sample textbooks sometimes use Z throughout — but the t-test costs nothing extra computationally and is always correct.

Decision Rule

Is population sigma known? YES Z-test valid NO n >= 100? (large sample) YES Z approx valid; t better NO t-test REQUIRED

When in doubt, use the t-test. It gives the correct answer whether you have 5 observations or 5 million. The Z-test is a special case of the t-test at .

The Z vs t distinction builds on the t-distribution (post 6), which derives why replacing with requires heavier tails. It resolves a question left open by the Z-test (post 5): what happens when is not known? Post 7 shows the t-test in all three variants. The same reasoning behind Welch's adjusted degrees of freedom — that unequal variance reduces effective information — applies in ANOVA (posts 14–16) when comparing multiple groups. Power analysis (post 9) shows that this choice also affects how many samples you need for a given effect size.

Honest Limitations

Even the t-test has assumptions: independence of observations, and approximate normality (relaxed by CLT for large ). For proportions and counts, neither Z nor t is ideal for small samples — use exact binomial tests or Fisher's exact test. For severely non-Normal data at any sample size, consider bootstrap confidence intervals or non-parametric tests (Mann-Whitney U, Wilcoxon signed-rank) rather than either Z or t.

Test Your Understanding

  1. For the pilot study ( per arm), compute the exact Type I error probability if Z critical values are used with df = 22. Use the t-distribution's CDF to find .
  2. A team has CTR observations per arm with , . They want to use a pooled t-test. What is the key assumption being made, and is it likely to hold given the observed standard deviations?
  3. You know from a year of production logs that CTR has population standard deviation . You are testing a new arm with observations. Are you justified in using a Z-test, and why does the answer depend on more than just knowing ?
  4. Both Z and t give critical value for the production run (). A colleague says "so it doesn't matter which we use." Is this always true, or are there scenarios at large where the choice still matters?
  5. Calculate how many more samples you need in the pilot (beyond per arm) to bring the pilot's power above 80% for a true 0.15-unit CTR difference, using and .

Comments (0)

No comments yet. Be the first to comment!

Leave a comment