Back to blog
← View series: statistics

~/blog

Z-Test

Apr 11, 20267 min readBy mohammed.vasim
StatisticsMathData Science

Your team has been logging API response latency for six months. You know from historical data — millions of requests — that the population mean latency is ms with population standard deviation ms. You deploy a new caching layer and measure latency on a sample of recent requests. The sample mean drops to ms. Is that improvement real, or within normal random variation?

When you genuinely know — not just estimate it — the Z-test is the right tool. In practice, knowing is rare, but this scenario legitimately qualifies: six months of production data gives you a stable, reliable estimate of that can be treated as known. Understanding Z-tests thoroughly also builds the intuition that carries over to every t-test you will ever run.

The Dataset

  • Historical population: ms, ms (known from production logs)
  • Sample after caching: ms, requests
  • Observed improvement: 12 ms (10% reduction)

When Z-Tests Are Valid

The Z-test applies when either:

  1. Population standard deviation is truly known (from historical data at scale), or
  2. Sample size is large enough ( typically) that the CLT makes and the distinction between Z and t becomes minor

The honest distinction: When is known exactly, Z is exact. When CLT justifies approximating with large , Z is an approximation — valid, but an approximation. The t-test is always the safer choice when is estimated, because it accounts for the uncertainty in .

The Z-statistic standardizes your sample mean:

Under , this follows a standard normal distribution .

Six-Step Procedure

Step 1: State hypotheses. We suspect latency decreased — but since we want to detect any change (caching could also introduce overhead in some edge cases):

ms (latency unchanged)

ms (latency changed — two-tailed)

Significance level:

Step 2: Verify conditions. ms known from historical data, . CLT conditions satisfied.

Step 3: Calculate the test statistic.

Step 4: Find the critical value.

For two-tailed:

Step 5: Calculate the p-value.

Step 6: Decision.

: reject . The latency improvement is statistically significant at .

PhaseFormulaValuesResult
Standard error ms
Z statistic
One-tail probability
p-value (two-tailed) one-tail
0 -1.96 +1.96 -2.743 Z = -2.743 rejects H0 Standard normal — shaded = rejection region (alpha=0.05 two-tailed) — amber = observed Z

Effect Size

Statistical significance does not tell you how meaningful the improvement is. Cohen's d for a one-sample test:

A Cohen's d of is a small-to-medium effect. The improvement is real but modest relative to the spread of latency values.

Cohen's dInterpretation
0.2Small
0.5Medium
0.8Large

Power Analysis

Power is the probability of correctly rejecting when it is false. For a two-tailed Z-test with , , , and true mean ms:

The non-centrality is .

The test has 78% power for a 12 ms improvement — decent, but below the typical 80% target. To achieve 80% power for this effect size, you would need requests.

python
import numpy as np
from scipy import stats

mu_0 = 120   # historical mean latency (ms)
sigma = 35   # known population std (ms)
n = 64       # sample size
x_bar = 108  # sample mean after caching

se = sigma / np.sqrt(n)
z_stat = (x_bar - mu_0) / se
p_value = 2 * stats.norm.cdf(z_stat)  # z is negative, so cdf gives left tail

print(f"Standard error: {se:.3f} ms")
print(f"Z statistic: {z_stat:.4f}")
print(f"p-value: {p_value:.4f}")
print(f"Reject H0: {p_value < 0.05}")

# 95% confidence interval
z_crit = stats.norm.ppf(0.975)
ci_lower = x_bar - z_crit * se
ci_upper = x_bar + z_crit * se
print(f"95% CI: [{ci_lower:.2f}, {ci_upper:.2f}] ms")

# Power calculation
mu_1 = 108  # hypothesized true mean under H1
ncp = (mu_0 - mu_1) / se  # non-centrality parameter (positive)
power = stats.norm.cdf(-z_crit + ncp) + (1 - stats.norm.cdf(z_crit + ncp))
print(f"Power: {power:.4f}")

# Cohen's d
cohens_d = (x_bar - mu_0) / sigma
print(f"Cohen's d: {cohens_d:.4f}")
Standard error: 4.375 ms Z statistic: -2.7429 p-value: 0.0061 Reject H0: True 95% CI: [99.43, 116.57] ms Power: 0.7830 Cohen's d: -0.3429

The Three Variants

One-sample Z-test: Compare a sample mean to a known population mean. Used above.

Two-sample Z-test: Compare means of two independent groups with known .

Z-test for proportions: Compare a sample proportion to a hypothesized value.

Proportions tests are common in A/B testing (post 3 used this variant).

Critical Values Reference

Significance LevelTwo-tailedOne-tailed (right)

When NOT to Use Z-Tests

When is unknown and is small: Use t-tests. Using Z with an estimated standard deviation inflates your Type I error rate — your stated 5% false positive rate becomes 7-10% in practice.

For paired data: Use paired t-tests. Treating dependent observations as independent violates the independence assumption.

For non-normal data with small samples: The CLT does not apply. Consider non-parametric alternatives.

Relationship to Confidence Intervals

A confidence interval for the mean:

Since ms is not in , we reject — consistent with the test result. The interval adds information: the true mean is plausibly anywhere from 99 to 117 ms, not necessarily exactly 108 ms.

The Z-test builds on the CLT (post 1) to justify Normal sampling distributions for means. It is the purest form of the hypothesis testing procedure introduced in post 3. The t-test (posts 6 and 7) is a direct generalization that handles unknown — the comparison is developed in post 8. Power analysis connects to Type I and Type II errors (post 9), where the tradeoff between and is explored in full. The confidence interval for this example is developed in detail in post 11.

Honest Limitations

The Z-test's main limitation in practice is the rarity of truly known . Even production logs from millions of requests can have non-stationarities — latency distributions shift with load, software updates, and infrastructure changes. If itself has changed since your historical baseline, the Z-test uses the wrong denominator. Always validate that your historical is still representative before treating it as known.

Test Your Understanding

  1. You change the cache implementation and now have samples with ms. Using ms from history, test at . What is the Z statistic and decision?
  2. Calculate the power of the original test (, , true mean 108 ms) for a one-tailed test at . Is it higher or lower than the two-tailed power of 78%?
  3. A colleague argues that since the sample is large enough (), you could use the sample standard deviation instead of the known and still use Z-critical values. Is this correct? When exactly does this approximation become acceptable?
  4. The 95% CI for latency after caching was ms. The operations team requires latency below 110 ms to meet SLA. Does the confidence interval guarantee the SLA is met?
  5. You run the same Z-test on a one-tailed basis () because you only care about latency decreases. How does the critical value and p-value change? When is this one-tailed framing justified?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment