~/blog

Z-Test vs t-Test

Apr 11, 2026•9 min read•By Mohammed Vasim

StatisticsMathData Science

You have six CV fold accuracy scores. Should you run a z-test or a t-test? The answer has real consequences: using the wrong one with n=6 inflates your Type I error from 5% to over 10%.

The Decision Rule — State It First

Use z-test when: σ is known AND (data is Normal OR n ≥ 30 by CLT).

Use t-test when: σ is unknown — which is almost always the case in practice.

In ML model evaluation, σ (the population variance of accuracy scores) is unknown a priori. The t-test is the correct choice. The z-test applies primarily to:

Quality control (historical σ from millions of measurements)
Large-sample asymptotic inference (n ≥ 30, where s ≈ σ anyway)
Proportion tests (separate formula, similar logic)

For the accuracy anchors in this post:

text

accuracy_small = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88]
# n=6, x̄=0.838, s=0.0477, σ unknown → USE T-TEST

accuracy_large: 50 CV folds, x̄=0.838, σ=0.05 from historical benchmarks
# n=50, σ known → Z-TEST is valid

The Only Formal Difference

The two test statistics differ by a single symbol:

Z-statistic: Z = (X̄ − μ₀) / (σ / √n) — divides by population SE (σ known, fixed number)

T-statistic: T = (X̄ − μ₀) / (S / √n) — divides by estimated SE (S computed from data)

That one substitution (σ → S) changes the reference distribution from N(0,1) to t(n−1).

Why S Produces Heavier Tails

S is a random variable — it varies from sample to sample. T is the ratio of two random quantities:

T = (X̄ − μ₀) / (S/√n) = Z / √(χ²(n−1)/(n−1))

This is the textbook definition of the t-distribution: a standard Normal divided by the square root of an independent chi-squared variable scaled by its degrees of freedom. The chi-squared denominator adds variability, pushing probability mass into the tails. As n grows, S stabilizes close to σ, and t(n−1) converges to N(0,1).

What Happens When You Use Z Instead of T (Small n)

With n=6, df=5, α=0.05, two-tailed:

Correct critical value (t-test): t*(df=5, α=0.05) = 2.571
Wrong critical value (z-test): z* = 1.960

If you use z*=1.960 when you should use t*=2.571, you set a lower bar for rejecting H₀. Your actual Type I error rate is no longer 5% — it is P(|t(5)| > 1.960), which is 10.8%.

Using z with df=5 more than doubles your false positive rate.

Critical Value Convergence

As df grows, t*(df) decreases toward z*=1.960:

n	df	t* (α=0.05, two-tailed)	z*	Difference
6	5	2.571	1.960	0.611
11	10	2.228	1.960	0.268
21	20	2.086	1.960	0.126
31	30	2.042	1.960	0.082
61	60	2.000	1.960	0.040
121	120	1.980	1.960	0.020
∞	∞	1.960	1.960	0.000

At df=30, the gap is only 0.082 — practically negligible. This is why the "n≥30, use z" rule exists. But the t-test is always valid at any n, converging to z at large n.

Side-by-Side Worked Example

Running both tests on the same small anchor (n=6):

Z-test (pretending σ = 0.05 is known):

Z = (x̄ − μ₀) / (σ/√n) = (0.838 − 0.80) / (0.05/√6) = 0.038 / 0.02041 = 1.863 p-value (two-tailed): 2 × P(Z > 1.863) = 0.062 Critical value: z* = 1.960 → fail to reject (1.863 < 1.960)

T-test (σ unknown, use S=0.0477):

T = (x̄ − μ₀) / (S/√n) = (0.838 − 0.80) / (0.0477/√6) = 0.038 / 0.01948 = 1.951 df=5, p-value (two-tailed): 2 × P(T(5) > 1.951) = 0.108 Critical value: t*(df=5) = 2.571 → fail to reject (1.951 < 2.571)

Note: T > Z here because S=0.0477 < σ=0.05 (sample SD slightly smaller than assumed population SD). Both fail to reject at α=0.05. The critical point: if T were between 1.960 and 2.571 (e.g., T=2.1), z would reject but t would not — and t would be correct.

Large-Sample Worked Example (n=50)

Same test, larger sample — shows the convergence in action:

text

accuracy_large: 50 CV folds, x̄=0.842, s=0.049
H₀: μ = 0.80, α = 0.05 (two-tailed)

Z-test (σ=0.05 known):

Z = (0.842 − 0.80) / (0.05/√50) = 0.042 / 0.00707 = 5.94

p-value: 2 × P(Z > 5.94) ≈ 0.0000003 → Reject H₀

T-test (σ unknown, use S=0.049):

T = (0.842 − 0.80) / (0.049/√50) = 0.042 / 0.00693 = 6.06

df=49, p-value: 2 × P(T(49) > 6.06) ≈ 0.0000002 → Reject H₀

Critical values at n=50 (df=49):

	z-test	t-test	Difference
Critical value	1.960	2.010	0.050
Test statistic	5.94	6.06	0.12
p-value	≈ 0	≈ 0	—
Decision	Reject	Reject	Same

The critical values differ by only 0.050 — both test statistics blow past either threshold. With n=50, the choice between z and t is academic: they agree. The t-test remains technically correct (σ is still unknown), but the practical difference is negligible.

This is why the convergence table matters: at df=5, the gap is 0.611 (can change your decision); at df=49, it is 0.050 (never changes your decision).

The "n≥30, Use Z" Rule — and Its Failure Mode

Useful as a simplification: at df=30, the gap is only 0.082. S ≈ σ in most cases. Practically interchangeable.

Dangerous if followed blindly: with heavy-tailed or bimodal data, CLT hasn't fully kicked in at n=30. Neither z nor t is guaranteed valid. Check assumptions.

Conservative choice: always use t. It is always valid (assuming normality or CLT), and it converges to z at large n. You lose nothing by using t.

Code

python

from scipy import stats
import numpy as np

accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88]
mu_0 = 0.80
sigma_assumed = 0.05  # hypothetically known
n = len(accuracy)
x_bar = np.mean(accuracy)
s = np.std(accuracy, ddof=1)

# Z-test (assuming sigma known)
Z = (x_bar - mu_0) / (sigma_assumed / np.sqrt(n))
p_z = 2 * (1 - stats.norm.cdf(abs(Z)))

# T-test (sigma unknown, use s)
T = (x_bar - mu_0) / (s / np.sqrt(n))
p_t = 2 * (1 - stats.t.cdf(abs(T), df=n-1))

print(f"x̄={x_bar:.4f}, s={s:.4f}, n={n}")
print(f"Z-test: Z={Z:.3f}, p={p_z:.4f}, critical z*=1.960")
print(f"T-test: T={T:.3f}, p={p_t:.4f}, critical t*(df=5)={stats.t.ppf(0.975, df=n-1):.3f}")

# Type I error if you use z* with df=5
z_crit = stats.norm.ppf(0.975)
actual_type1 = 2 * (1 - stats.t.cdf(z_crit, df=n-1))
print(f"\nIf you use z*={z_crit:.3f} with df=5:")
print(f"  Actual Type I error = {actual_type1:.4f} (target was 0.0500)")

# Critical value convergence table
print("\nCritical value convergence (α=0.05, two-tailed):")
print(f"{'df':>5} | {'t*':>7} | {'diff from z*':>12}")
for df in [5, 10, 20, 30, 60, 120]:
    t_star = stats.t.ppf(0.975, df=df)
    print(f"{df:>5} | {t_star:>7.3f} | {t_star - 1.960:>12.3f}")

text

x̄=0.8383, s=0.0477, n=6
Z-test: Z=1.863, p=0.0625, critical z*=1.960
T-test: T=1.951, p=0.1083, critical t*(df=5)=2.571

If you use z*=1.960 with df=5:
  Actual Type I error = 0.1076 (target was 0.0500)

Critical value convergence (α=0.05, two-tailed):
   df |      t* | diff from z*
    5 |   2.571 |        0.611
   10 |   2.228 |        0.268
   20 |   2.086 |        0.126
   30 |   2.042 |        0.082
   60 |   2.000 |        0.040
  120 |   1.980 |        0.020

Decision Table

Situation	Use	Why
σ known, data Normal, any n	Z-test	Exact
σ known, non-normal, n ≥ 30	Z-test	CLT
σ unknown, data Normal, any n	T-test	Accounts for uncertainty in S
σ unknown, non-normal, n ≥ 30	T-test (≈ z-test)	CLT + S ≈ σ
σ unknown, non-normal, n < 30	Wilcoxon signed-rank	Non-parametric
σ unknown, n large (≥ 100)	Either	Nearly identical results

The conservative choice is always the t-test. It handles all cases correctly and converges to z at large n.

Test Your Understanding

With n=6 and σ unknown, you compute T=2.10. Using z* =1.960 you reject; using t*(df=5)=2.571 you fail to reject. Which decision is correct and why? What Type I error probability does the z-test decision carry in this scenario?
A colleague says "n=50 folds is plenty — z-test is fine." You look at the data and see extreme skewness (one fold has accuracy 0.45, the rest 0.85–0.92). Does n=50 guarantee z-test validity? What should you check instead?
The mathematical identity is T = Z / √(χ²(n−1)/(n−1)). As n grows, what happens to χ²(n−1)/(n−1) and why does that make T → Z? What theorem is responsible?
Both z and t fail to reject on the small anchor (n=6). A colleague argues "the tests agree, so the choice doesn't matter." Construct a scenario with n=6 where the two tests would give opposite decisions, and identify the T-statistic range where this split occurs.
For the large anchor (n=50, σ=0.05 known), Z=5.39 (highly significant). If you mistakenly used the t-test instead of z, would the conclusion change? At what threshold would using the wrong test actually matter for a conclusion?

Z-Test vs t-Test

The Decision Rule — State It First

The Only Formal Difference

Why S Produces Heavier Tails

What Happens When You Use Z Instead of T (Small n)

Critical Value Convergence

Side-by-Side Worked Example

Large-Sample Worked Example (n=50)

The "n≥30, Use Z" Rule — and Its Failure Mode

Code

Decision Table

Test Your Understanding

Comments (0)

Leave a comment