~/blog

Standard Normal Distribution and Z-Scores

Apr 11, 2026•13 min read•By Mohammed Vasim

StatisticsMathData Science

Before computers, computing probabilities for any Normal distribution required numerical integration. Maintaining a separate integral table for every possible (μ, σ) pair was impractical. The solution: transform any Normal(μ, σ²) to the single canonical N(0, 1) via the z-score, and maintain one table. Today the z-score has three distinct uses in ML/DS — probability calculations, feature scaling, and outlier detection — all built on the same transformation.

The DS/ML Anchor

Prediction errors (residuals) from a regression model on house prices. Over 500 test predictions: residuals ~ Normal(μ=0, σ=18.4), in thousands of dollars. μ=0 because the model is unbiased; σ=18.4 means typical prediction error ≈ $18,400.

Questions: What fraction of predictions are off by more than $30k? What residual corresponds to the 95th percentile?

The Z-Score Transformation

$Z = \frac{X - μ}{σ}$

If X ~ Normal(μ, σ²), then Z ~ Normal(0, 1).

Algebraic proof:

E[Z] = E[(X−μ)/σ] = (E[X] − μ)/σ = (μ − μ)/σ = 0 ✓

Var[Z] = Var[(X−μ)/σ] = Var[X]/σ² = σ²/σ² = 1 ✓

Since a linear transformation of a Normal is Normal, Z ~ Normal(0, 1). The shape is identical — only the axis scale changes.

Applied to anchor: a residual of −35 (model overestimated by $35k):

Z = (−35 − 0) / 18.4 = −1.902

Interpretation: this prediction error is 1.902 standard deviations below the mean error.

The Standard Normal PDF

$ϕ (z) = \frac{1}{2 π} e^{- z^{2} /2}$

Each component:

Component	Value at z=0	Role
`1/√(2π)`	0.399	Normalization: total area = 1
`e^{-z²/2}`	1 at z=0	Gaussian decay — falls rapidly as \|z\| grows

At z=0: φ(0) = 1/√(2π) = 0.399 (peak of the bell)

At z=3: φ(3) = 0.399 × e^{-4.5} = 0.0044 (barely visible in the tails)

By symmetry: φ(−1.902) = φ(1.902) — points equidistant from zero have equal density.

The Standard Normal CDF

$Φ (z) = P (Z \leq z) = \int_{- \infty}^{z} ϕ (t) d t$

The CDF cannot be written in closed form — the Gaussian integral has no antiderivative in elementary functions. It is computed numerically via the error function: Φ(z) = (1/2)[1 + erf(z/√2)].

Key values to know:

z	Φ(z)	Use
−2.576	0.005	—
−1.960	0.025	—
−1.645	0.05	—
0	0.500	By symmetry
1.645	0.950	One-tail α=0.05 critical value
1.960	0.975	Two-tail α=0.05 critical value
2.576	0.995	Two-tail α=0.01 critical value

The hypothesis test critical values z*=1.645 and z*=1.960 come directly from Φ⁻¹. When you see "z=1.96 for a 95% confidence interval," that's the inverse CDF: Φ(1.96) = 0.975, so there's 2.5% in each tail.

Probability Calculations — Four Query Types

Query 1 — P(X < a): left tail

P(residual < −35) = P(Z < −1.902) = Φ(−1.902) = 1 − Φ(1.902) = 1 − 0.971 = 0.029

2.9% of predictions overestimate price by more than $35k.

Query 2 — P(X > a): right tail

P(residual > 30) = P(Z > 30/18.4) = P(Z > 1.630) = 1 − Φ(1.630) = 1 − 0.948 = 0.052

5.2% of predictions underestimate price by more than $30k.

Query 3 — P(a < X < b): middle interval

P(−20 < residual < 20) = Φ(20/18.4) − Φ(−20/18.4) = Φ(1.087) − Φ(−1.087) = 0.861 − 0.139 = 0.723

72.3% of predictions are within ±$20k of the true price.

Query 4 — Inverse: find x given P(X ≤ x) = q

What residual is at the 95th percentile?

z = Φ⁻¹(0.95) = 1.645 → x = μ + z×σ = 0 + 1.645×18.4 = $30.3k

95% of errors underestimate price by less than $30.3k (or overestimate by any amount).

The Empirical Rule (68-95-99.7)

Since μ=0 in the standard normal, these are:

P(−1 < Z < 1) = Φ(1) − Φ(−1) = 2Φ(1) − 1 = 2(0.8413) − 1 = 0.6827

P(−2 < Z < 2) = Φ(2) − Φ(−2) = 2(0.9772) − 1 = 0.9545

P(−3 < Z < 3) = Φ(3) − Φ(−3) = 2(0.9987) − 1 = 0.9973

Applied to anchor (multiply kσ = k×18.4):

Band	Dollar range	Fraction of predictions
±1σ	±$18.4k	68.3% within ±$18,400
±2σ	±$36.8k	95.5% within ±$36,800
±3σ	±$55.2k	99.7% within ±$55,200

This rule is exact only for Normal data. For non-Normal distributions, Chebyshev's inequality gives a weaker but universally valid bound: P(|X−μ| > kσ) ≤ 1/k². At k=2: at most 25% beyond ±2σ (versus the 4.55% from the empirical rule for Normal).

Z-Scores for Feature Scaling in ML

The same z-score transformation applied to each feature:

$z_{ij} = \frac{x _{ij} - x ˉ _{j}^{(train)}}{s _{j}^{(train)}}$

Why it helps gradient descent: features on different scales (house size in sq ft, price in $M, age in years) cause elongated loss contours. Gradient descent oscillates along the steep axes and moves slowly along the flat ones. After standardization, the loss surface is more spherical and convergence is faster.

Critical implementation detail: fit the scaler on training data only. Apply the same μ and σ to validation and test sets. Fitting on test data is data leakage.

python

from sklearn.preprocessing import StandardScaler
import numpy as np

X_train = np.array([[2100, 450000], [1800, 380000], [2400, 520000]])
X_test  = np.array([[1950, 415000]])

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)   # fit on train, then transform
X_test_scaled  = scaler.transform(X_test)          # transform only — no fit on test

Z-Scores for Outlier Detection

Three-sigma rule: flag any observation with |Z| > 3 as a potential outlier.

Applied to anchor: a residual with Z = −4.5 corresponds to x = 0 + (−4.5)×18.4 = − $82.8 k . T h e m o d e l o v er es t ima t e d b y$ 82,800 — unusual enough to investigate.

Limitation: z-score outlier detection assumes Normality. For right-skewed data (latency, salary), z-scores will miss outliers in the long right tail and incorrectly flag values in the compressed left tail.

Modified z-score (robust alternative using median and MAD):

$M_{i} = \frac{0.6745 \times ( x _{i} - median )}{MAD}$

Flag |M| > 3.5. More robust because it uses the median and median absolute deviation rather than mean and SD — it's resistant to the outliers it's trying to detect.

Critical Values Table

p (percentile)	z = Φ⁻¹(p)	Common use
1%	−2.326	—
2.5%	−1.960	—
5%	−1.645	—
50%	0	—
95%	1.645	One-tailed α=0.05 test
97.5%	1.960	Two-tailed α=0.05 test (95% CI)
99%	2.326	Two-tailed α=0.01 test
99.5%	2.576	99% CI

The hypothesis test critical value z*=1.96 comes from Φ(1.96) = 0.975 — leaving 2.5% in each tail for a total α=0.05 two-tailed test.

Probability Query Cheatsheet

Query	Formula	Anchor result
P(X < a)	Φ((a−μ)/σ)	P(res < −35k) = 0.029
P(X > a)	1 − Φ((a−μ)/σ)	P(res > 30k) = 0.052
P(a < X < b)	Φ((b−μ)/σ) − Φ((a−μ)/σ)	P(−20k < res < 20k) = 0.723
pth percentile	μ + Φ⁻¹(p) × σ	95th percentile = 30.3k

Code

python

from scipy import stats
import numpy as np

mu, sigma = 0, 18.4  # residuals ~ N(0, 18.4)

# Z-score transformation
residual = -35
z = (residual - mu) / sigma
print(f"Residual={residual}k, Z-score={z:.3f}")

# Probability queries
print(f"\nP(residual < -35k) = P(Z < {z:.3f}) = {stats.norm.cdf(z):.4f}")
print(f"P(residual > 30k)  = P(Z > {30/sigma:.3f}) = {1-stats.norm.cdf(30/sigma):.4f}")
print(f"P(-20k < res < 20k)= P({-20/sigma:.3f} < Z < {20/sigma:.3f}) = {stats.norm.cdf(20/sigma) - stats.norm.cdf(-20/sigma):.4f}")

# Inverse: 95th percentile
p95 = stats.norm.ppf(0.95, loc=mu, scale=sigma)
print(f"\n95th percentile of residuals: {p95:.2f}k")
print(f"  (Z = {stats.norm.ppf(0.95):.3f})")

# Empirical rule
for k in [1, 2, 3]:
    prob = stats.norm.cdf(k) - stats.norm.cdf(-k)
    print(f"±{k}σ (±{k*sigma}k): {prob:.4f} ({prob*100:.2f}% of predictions)")

# Key critical values
print("\nCritical values (inverse CDF):")
for p, label in [(0.95, "one-tail α=0.05"), (0.975, "two-tail α=0.05"), (0.995, "two-tail α=0.01")]:
    print(f"  Φ⁻¹({p}) = {stats.norm.ppf(p):.3f} → {label}")

text

Residual=-35k, Z-score=-1.902

P(residual < -35k) = P(Z < -1.902) = 0.0285
P(residual > 30k)  = P(Z > 1.630) = 0.0516
P(-20k < res < 20k)= P(-1.087 < Z < 1.087) = 0.7231

95th percentile of residuals: 30.27k
  (Z = 1.645)

±1σ (±18.4k): 0.6827 (68.27% of predictions)
±2σ (±36.8k): 0.9545 (95.45% of predictions)
±3σ (±55.2k): 0.9973 (99.73% of predictions)

Critical values (inverse CDF):
  Φ⁻¹(0.95) = 1.645 → one-tail α=0.05
  Φ⁻¹(0.975) = 1.960 → two-tail α=0.05
  Φ⁻¹(0.995) = 2.576 → two-tail α=0.01

Normal distribution: the general N(μ, σ²) that the z-score standardizes
t-distribution: replaces z when σ is unknown and estimated from a small sample; converges to N(0,1) as n grows
Confidence intervals: the ±1.96 in "mean ± 1.96 × SE" is Φ⁻¹(0.975) — z-score of the 97.5th percentile
Hypothesis testing: test statistics are z-scores of sample statistics under the null hypothesis

Limitations

Assumes Normality. Standardizing non-normal data doesn't make it normal. Z-score probabilities and outlier thresholds require that the underlying distribution is (or approximates) Normal.
σ unknown in practice. When σ is estimated from data, the standardized statistic follows a t-distribution, not N(0,1). Use t-distribution critical values for n < 30.
Mean and SD are not robust. Outliers inflate σ and shift x̄, which can make the z-score of an outlier smaller, not larger — the outlier appears less extreme after it inflates σ. The modified z-score (using median/MAD) avoids this.

Test Your Understanding

A prediction error of +$50k has Z = 50/18.4 = 2.72. Compute P(residual > 50k) and P(|residual| > 50k). Interpret both probabilities operationally.
The team sets a quality gate: flag any prediction with |residual| > $40k for manual review. What fraction of predictions will be flagged under the Normal(0, 18.4) model?
You are standardizing features for a neural network. You fit a StandardScaler on training data (mean=300, std=75) and apply it to a test point with value 420. What is the z-score? Why would fitting the scaler on all data (train + test) be a mistake?
A colleague claims "the 95% confidence interval formula x̄ ± 1.96 × SE is just an arbitrary rule." Explain where 1.96 comes from using the standard normal CDF, and under what conditions the formula is valid.
Residuals from a new model are right-skewed (many large positive errors, few large negative ones). A standard z-score analysis flags 0.4% of predictions as outliers. Is this count likely to be too high, too low, or about right compared to what you'd expect for actual outliers in this data? What alternative would you use?

Standard Normal Distribution and Z-Scores

The DS/ML Anchor

The Z-Score Transformation

The Standard Normal PDF

The Standard Normal CDF

Probability Calculations — Four Query Types

The Empirical Rule (68-95-99.7)

Z-Scores for Feature Scaling in ML

Z-Scores for Outlier Detection

Critical Values Table

Probability Query Cheatsheet

Code

Limitations

Test Your Understanding

Comments (0)

Leave a comment

Standard Normal Distribution and Z-Scores

The DS/ML Anchor

The Z-Score Transformation

The Standard Normal PDF

The Standard Normal CDF

Probability Calculations — Four Query Types

The Empirical Rule (68-95-99.7)

Z-Scores for Feature Scaling in ML

Z-Scores for Outlier Detection

Critical Values Table

Probability Query Cheatsheet

Code

Related Concepts

Limitations

Test Your Understanding

Comments (0)

Leave a comment