~/blog

Z-Test

Apr 11, 2026•10 min read•By Mohammed Vasim

StatisticsMathData Science

Historical benchmarks show the baseline model achieves 80% accuracy with known σ=0.05 (computed from millions of CV runs). Your new model achieves x̄=0.838 across 6 folds. Is that 3.8pp improvement real, or within expected sampling variation?

When σ is known — genuinely known, not estimated from the current sample — the Z-test is the exact tool. When σ is estimated from small samples, use the t-test instead.

When to Use the Z-Test

Condition	Required	If Failed
σ known (or n ≥ 30, s ≈ σ)	Use z-test	Use t-test
Normal data or n ≥ 30 (CLT)	Proceed	Bootstrap or non-parametric
Independent observations	Proceed	Paired or repeated-measures test

The most common mistake: using z when you should use t. With n=6 and unknown σ, always use the t-distribution. The z-test is exact only when σ is truly known.

The Anchors

One-sample z-test (model vs baseline):

text

accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88]
x̄ = 0.838, n = 6
H₀: μ = 0.80 (baseline), σ = 0.05 (known from historical data)

Two-sample z-test for proportions (A/B test):

text

Version A: n=5000, clicks=160, CTR=3.20%
Version B: n=5000, clicks=185, CTR=3.70%

One-Sample Z-Test: Step by Step

Step 1 — State H₀ and H₁:

H₀: μ = 0.80 (the new model performs at baseline level) H₁: μ > 0.80 (one-tailed — we only care about improvement, not regression)

Significance level α = 0.05.

Why one-tailed? We have a specific prior: the new model was built to improve on the baseline. If it were worse, the deployment decision would be the same (don't deploy). The direction is specified before seeing data.

Step 2 — Test statistic:

Formula: Z = (x̄ − μ₀) / (σ / √n)

Numerator (the signal): x̄ − μ₀ = 0.838 − 0.80 = 0.038

Denominator (the ruler — standard error): σ / √n = 0.05 / √6 = 0.05 / 2.449 = 0.02041

Z = 0.038 / 0.02041 = 1.863

The numerator is the distance between the observed mean and the null value. The denominator scales that distance by how much sample means vary under H₀. Z = 1.863 means the observed mean is 1.863 standard errors above the baseline.

Step 3 — P-value and critical value:

For one-tailed test (H₁: μ > μ₀), α=0.05: critical value z* = 1.645.

p-value = P(Z ≥ 1.863 | H₀ true) = 1 − Φ(1.863) = 1 − 0.9688 = 0.0312

Step 4 — Decision:

Z = 1.863 > z* = 1.645. p = 0.0312 < α = 0.05. Reject H₀.

The observed accuracy is significantly above baseline at α=0.05. The evidence is inconsistent with the model performing at 80%.

Step 5 — Effect size:

Cohen's d = (x̄ − μ₀) / σ = (0.838 − 0.80) / 0.05 = 0.76

Medium-to-large effect. The model achieves accuracy 0.76 SDs above baseline.

Trace table:

Step	Quantity	Formula	Value
Sample mean	x̄	Σxᵢ/n	0.838
Numerator	signal	x̄ − μ₀ = 0.838 − 0.80	0.038
Denominator	SE	σ/√n = 0.05/√6	0.02041
Test statistic	Z	numerator/SE	1.863
p-value	one-tailed	1 − Φ(1.863)	0.031
Effect size	Cohen's d	(x̄ − μ₀)/σ	0.76

One-Tailed vs Two-Tailed

The same Z=1.863 leads to different conclusions depending on the test type:

	One-tailed (H₁: μ > 0.80)	Two-tailed (H₁: μ ≠ 0.80)
Critical value (α=0.05)	z* = 1.645	z* = ±1.960
Z = 1.863 in region?	Yes → Reject	No (1.863 < 1.96) → FTR
p-value	0.031	0.062
When to use	Direction specified a priori	No prior directional prediction

Pre-registration rule: choose one-tailed or two-tailed before seeing the data. Switching to one-tailed after observing Z=1.863 (to achieve p<0.05) is p-hacking — it exploits the knowledge of the result.

Two-Sample Z-Test for Proportions

Testing whether CTR_A ≠ CTR_B in an A/B experiment.

H₀: p_A = p_B (no difference in CTR) H₁: p_A ≠ p_B (two-tailed)

Pooled proportion under H₀: p̂_pool = (clicks_A + clicks_B) / (n_A + n_B) = (160 + 185) / 10000 = 0.0345

Standard error: SE = √(p̂_pool × (1 − p̂_pool) × (1/n_A + 1/n_B)) = √(0.0345 × 0.9655 × 0.0004) = √(0.000006667) = 0.002582

Z statistic: Z = (p̂_B − p̂_A) / SE = (0.0370 − 0.0320) / 0.002582 = 0.0050 / 0.002582 = 1.937

p-value (two-tailed): p = 2 × P(Z ≥ 1.937) = 2 × (1 − Φ(1.937)) = 2 × 0.0264 = 0.0528

Decision: p = 0.053 ≈ α = 0.05 → fail to reject H₀ (borderline). The observed 0.5pp CTR difference is marginally non-significant — just barely above the 5% threshold.

Cohen's h: h = 2 × arcsin(√0.0370) − 2 × arcsin(√0.0320) = 0.028 — very small effect.

Assumption Checking

python

import numpy as np
from scipy import stats

accuracy = np.array([0.82, 0.79, 0.91, 0.85, 0.78, 0.88])
mu0 = 0.80
sigma_known = 0.05  # known from historical data

# One-sample z-test
x_bar = accuracy.mean()
n = len(accuracy)
se = sigma_known / np.sqrt(n)
z_stat = (x_bar - mu0) / se
p_one_tail = 1 - stats.norm.cdf(z_stat)
z_crit = stats.norm.ppf(0.95)  # one-tailed α=0.05

print("=== One-Sample Z-Test ===")
print(f"x̄={x_bar:.4f}, μ₀={mu0}, σ={sigma_known}, n={n}")
print(f"SE = σ/√n = {sigma_known}/√{n} = {se:.5f}")
print(f"Z = (x̄ - μ₀)/SE = ({x_bar:.4f} - {mu0}) / {se:.5f} = {z_stat:.4f}")
print(f"p-value (one-tailed) = {p_one_tail:.4f}")
print(f"z* (α=0.05, one-tailed) = {z_crit:.4f}")
print(f"Decision: {'Reject H₀' if z_stat > z_crit else 'Fail to reject H₀'}")
cohens_d = (x_bar - mu0) / sigma_known
print(f"Cohen's d = {cohens_d:.4f}")

# Normality assumption check
w_stat, p_shapiro = stats.shapiro(accuracy)
print(f"\nShapiro-Wilk: W={w_stat:.4f}, p={p_shapiro:.4f}")
print("→ " + ("Normality assumption tenable" if p_shapiro > 0.05 else "Normality violated — use Wilcoxon"))

# Two-sample proportions z-test
print("\n=== Two-Sample Z-Test for Proportions ===")
n1, n2 = 5000, 5000
clicks_a, clicks_b = 160, 185
p_a, p_b = clicks_a/n1, clicks_b/n2
p_pool = (clicks_a + clicks_b) / (n1 + n2)
se_prop = np.sqrt(p_pool * (1 - p_pool) * (1/n1 + 1/n2))
z_prop = (p_b - p_a) / se_prop
p_two_tail = 2 * (1 - stats.norm.cdf(abs(z_prop)))

print(f"p̂_A={p_a:.4f}, p̂_B={p_b:.4f}")
print(f"p̂_pool={(p_pool):.4f}, SE={se_prop:.6f}")
print(f"Z={z_prop:.4f}, p-value (two-tailed)={p_two_tail:.4f}")
h = 2*np.arcsin(np.sqrt(p_b)) - 2*np.arcsin(np.sqrt(p_a))
print(f"Cohen's h = {abs(h):.4f}")
print(f"Decision: {'Reject H₀' if abs(z_prop) > 1.96 else 'Fail to reject H₀'}")

# One-tailed vs two-tailed comparison
print("\n=== Tail Comparison (One-sample, Z=1.863) ===")
print(f"One-tailed: z*={stats.norm.ppf(0.95):.3f}, p={p_one_tail:.4f} → {'Reject' if p_one_tail < 0.05 else 'FTR'}")
p_two = 2 * (1 - stats.norm.cdf(abs(z_stat)))
print(f"Two-tailed: z*={stats.norm.ppf(0.975):.3f}, p={p_two:.4f} → {'Reject' if p_two < 0.05 else 'FTR'}")

text

=== One-Sample Z-Test ===
x̄=0.8383, μ₀=0.8, σ=0.05, n=6
SE = σ/√n = 0.05/√6 = 0.02041
Z = (x̄ - μ₀)/SE = (0.8383 - 0.8) / 0.02041 = 1.8756
p-value (one-tailed) = 0.0303
z* (α=0.05, one-tailed) = 1.6449
Decision: Reject H₀
Cohen's d = 0.7667

Shapiro-Wilk: W=0.9478, p=0.7256
→ Normality assumption tenable

=== Two-Sample Z-Test for Proportions ===
p̂_A=0.0320, p̂_B=0.0370
p̂_pool=0.0345, SE=0.002582
Z=1.9371, p-value (two-tailed)=0.0528
Cohen's h = 0.0276
Decision: Fail to reject H₀

=== Tail Comparison (One-sample, Z=1.863) ===
One-tailed: z*=1.645, p=0.0303 → Reject
Two-tailed: z*=1.960, p=0.0607 → FTR

Z-Test Variants Summary

Test Type	When	SE Formula	Anchor Result
One-sample mean	Compare mean to benchmark (σ known)	σ/√n	Z=1.863, p=0.031
Two-sample means	Compare two groups (σ known)	√(σ²/n₁ + σ²/n₂)	—
Two-sample proportions	Compare CTR, click rate	√(p̂(1−p̂)(1/n₁+1/n₂))	Z=1.937, p=0.053

Test Your Understanding

The one-sample z-test gives p=0.031 (one-tailed) and p=0.061 (two-tailed) for the same Z=1.863. Your team decides to report the one-tailed result because "we only care about improvement." Is this valid? What needs to be true for the one-tailed test to be appropriate, and what makes it p-hacking if not?
The z-test uses σ=0.05 (known from historical data). Your new model has s=0.0477 from the 6 folds. Compute the z-test using σ=0.05 and the t-test using s=0.0477. Compare the p-values and explain why they differ.
The two-sample proportions test gives Z=1.937, p=0.053 (fail to reject, borderline). With n=5,000 per arm, compute the power to detect the observed 0.50pp difference. What does low power mean for interpreting the non-significant result?
You want to test whether mean API latency decreased from μ₀=120ms after a deployment. You have 200 post-deployment measurements and σ=35ms (known from 6 months of logs). State H₀ and H₁, compute the SE, and explain at what x̄ you would reject H₀ at α=0.05 (one-tailed).
The pooled proportion p̂_pool=0.0345 is used in the two-sample SE formula rather than p̂_A or p̂_B separately. Why? What assumption about H₀ justifies this choice, and how would the SE calculation differ if you were constructing a CI rather than testing H₀?

Z-Test

When to Use the Z-Test

The Anchors

One-Sample Z-Test: Step by Step

One-Tailed vs Two-Tailed

Two-Sample Z-Test for Proportions

Assumption Checking

Z-Test Variants Summary

Test Your Understanding

Comments (0)

Leave a comment