← View series: statistics
~/blog
Normal/Gaussian Distribution
When you average enough independent random things — test scores, measurement errors, model predictions averaged across random seeds — the result tends toward the same shape regardless of the original distribution. That shape is the Normal distribution. This convergence property, the Central Limit Theorem, is why the bell curve appears so widely and why so much of statistical inference is built on Normal assumptions. Understanding not just how to use Normal, but why it emerges and when to trust it, is one of the most consequential things you can learn in statistics.
The DS/ML anchor
Throughout this post we'll work with model accuracy scores across experiments. A team trains the same neural network architecture 80 times with different random seeds and records the validation accuracy for each run. Historically, these accuracy scores have mean μ = 0.847 and standard deviation σ = 0.031. Let accuracy_score ~ N(0.847, 0.031²).
The Mathematics
The Normal distribution's probability density function:
Notation: X ~ N(μ, σ²)
With actual values substituted for our accuracy scores:
The distribution is symmetric around μ = 0.847, with the classic bell shape tapering off toward 0.75 and 0.95.
The 68-95-99.7 Rule
Applied to accuracy scores:
- 68% of runs produce accuracy between 0.816 and 0.878 (μ ± σ)
- 95% fall between 0.785 and 0.909 (μ ± 2σ)
- 99.7% fall between 0.754 and 0.940 (μ ± 3σ)
A run producing accuracy below 0.75 would be statistically extraordinary — more than 3σ below the mean.
CDF
The CDF F(x) = P(accuracy_score ≤ x) tells you what fraction of runs fall at or below a given threshold. This is what you query when evaluating whether a new training run is "good enough."
Trace Table: Accuracy Score Calculations
With μ = 0.847, σ = 0.031:
| Phase | Formula | Values | Result |
|---|---|---|---|
| Z-score for 0.90 | (x − μ) / σ | (0.90 − 0.847) / 0.031 | Z = 1.710 |
| P(accuracy ≤ 0.90) | Φ(Z = 1.710) | from standard normal table | 0.9564 |
| P(accuracy > 0.90) | 1 − Φ(1.710) | 1 − 0.9564 | 0.0436 |
| 95th percentile | μ + 1.645σ | 0.847 + 1.645 × 0.031 | 0.898 |
About 4.4% of runs exceed 0.90 accuracy. The 95th percentile run achieves 0.898.
CLT Connection
Why do accuracy scores follow a Normal distribution? Each run involves hundreds of mini-batch gradient updates, each introducing small random perturbations from initialization and data ordering. These perturbations are independent and add together throughout training. By the Central Limit Theorem, the aggregate effect — the final accuracy — converges toward a Normal distribution regardless of the individual perturbation distributions.
Here's the CLT demonstrated concretely: even if individual batch losses follow a right-skewed distribution, the average loss over 100 batches approaches Normal.
Shapiro-Wilk Normality Check
Before relying on the Normal model for accuracy scores, you should verify the assumption. The Shapiro-Wilk test checks whether a sample could plausibly have come from a Normal distribution.
from scipy import stats
import numpy as np
np.random.seed(42)
accuracy_scores = np.random.normal(loc=0.847, scale=0.031, size=80)
accuracy_scores = np.clip(accuracy_scores, 0.75, 0.95)
stat, p_value = stats.shapiro(accuracy_scores)
print(f"Shapiro-Wilk statistic : {stat:.4f}")
print(f"p-value : {p_value:.4f}")
if p_value > 0.05:
print("Cannot reject normality (p > 0.05) — Normal model is reasonable.")
else:
print("Reject normality (p <= 0.05) — consider a non-Normal model.")
mu_hat = accuracy_scores.mean()
sigma_hat = accuracy_scores.std(ddof=1)
print(f"\nEstimated mu = {mu_hat:.4f}")
print(f"Estimated sigma = {sigma_hat:.4f}")
p_above_90 = 1 - stats.norm.cdf(0.90, mu_hat, sigma_hat)
print(f"P(accuracy > 0.90) = {p_above_90:.4f}")
pct_95 = stats.norm.ppf(0.95, mu_hat, sigma_hat)
print(f"95th percentile = {pct_95:.4f}")Shapiro-Wilk statistic : 0.9892
p-value : 0.7643
Cannot reject normality (p > 0.05) — Normal model is reasonable.
Estimated mu = 0.8472
Estimated sigma = 0.0308
P(accuracy > 0.90) = 0.0426
95th percentile = 0.8979
A p-value of 0.76 means the data is fully consistent with a Normal distribution. If the p-value were below 0.05, you'd investigate whether the scores are bimodal (some runs converged, some didn't) or skewed toward a floor.
Related Concepts
The Normal distribution builds on the PDF and CDF introduced in the first post — it is the most important continuous distribution you'll encounter. It connects directly to the Central Limit Theorem, which was hinted at in the types-of-distributions post and is the reason the Normal appears in so many natural phenomena. Understanding Normal is a prerequisite for the standard normal and z-score post that follows, for t-tests and ANOVA (which assume normally-distributed residuals), for Gaussian process regression, and for variational autoencoders (which use the Normal distribution as the latent space prior). The 68-95-99.7 rule is also the implicit foundation behind "3-sigma control limits" in process monitoring.
Honest Limitations
The Normal distribution doesn't fit everything. Financial returns have fat tails — extreme events happen far more often than Normal predicts, which caused Normal-based risk models to catastrophically underestimate losses in 2008. Accuracy scores have natural bounds at 0 and 1 — a Normal model predicts nonzero probability of accuracy > 1 or accuracy < 0, which is physically impossible. For highly accurate models where σ is small relative to the distance from the boundary, this is a minor concern; for models near the boundaries, use Beta distribution instead.
Any time your data is clearly skewed, multimodal, or bounded in ways that conflict with the Normal's infinite-range, symmetric shape, resist the temptation to apply it because it's convenient. Run Shapiro-Wilk, plot the QQ plot, and choose accordingly.
Test Your Understanding
-
Accuracy scores across 80 runs have μ = 0.847, σ = 0.031. What is the probability that a randomly selected run has accuracy above 0.91? Compute both the Z-score and the probability.
-
The team sets a deployment threshold: only models with accuracy above the 90th percentile of historical runs get deployed. What is that threshold?
-
A new set of 30 runs shows mean accuracy 0.862. Does this deviate meaningfully from the historical mean? What test would you use, and what Normal distribution assumptions does it require?
-
Explain why the Central Limit Theorem implies that average batch losses during training should become approximately Normal even when individual batch losses are not. What condition is needed for CLT to apply?
-
A colleague runs Shapiro-Wilk on 10 accuracy scores and gets p = 0.03. They conclude "accuracy scores are not Normal." What problems exist with this conclusion, and what would you recommend instead?