~/blog

Normal/Gaussian Distribution

Apr 11, 2026•11 min read•By Mohammed Vasim

StatisticsMathData Science

Most statistical inference — hypothesis testing, confidence intervals, regression diagnostics — invokes the Normal distribution at some point. Before learning how to use it, it's worth asking: why does this particular distribution show up everywhere?

Why the Normal Distribution Is Central

Three independent reasons, not just one:

1. Central Limit Theorem. The sum (or mean) of many independent random variables converges to Normal, regardless of the original distribution. Since most statistics we compute — sample means, regression coefficients, log-likelihood ratios — are sums or means, their sampling distributions are approximately Normal.

2. Maximum entropy. Among all continuous distributions with a fixed mean and variance, the Normal distribution has the highest entropy — it makes the fewest additional assumptions about the data. If you know only μ and σ, the Normal is the least-committal distribution.

3. Mathematical tractability. Sums of independent normals are normal. Linear transformations of normals are normal. Conditional distributions in multivariate normal are normal. These closure properties make analytic inference possible in ways that other distributions don't allow.

The Normal is not "normal" because data is typically normally distributed — much real data isn't. It's central because statistics computed from data tend to be normally distributed.

The DS/ML Anchor

Six cross-validation fold scores: accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88]

μ = (0.82 + 0.79 + 0.91 + 0.85 + 0.78 + 0.88) / 6 = 0.855 ≈ 0.85

σ² = Σ(xᵢ − μ)² / (n−1) → σ ≈ 0.048

We model fold accuracy ~ Normal(μ=0.85, σ=0.048) throughout.

The PDF

$f (x) = \frac{1}{σ 2 π} exp (- \frac{( x - μ ) ^{2}}{2 σ ^{2}})$

Every component has a role:

Component	Role
`1/(σ√(2π))`	Normalization constant — ensures total area = 1
`(x−μ)²`	Squared distance from center — symmetric on both sides
`2σ²`	Scale — larger σ spreads the penalty, widening the bell
`exp(−...)`	Converts squared distance to density — exponential decay from center

Computing f(0.85) with our anchor:

$f (0.85) = \frac{1}{0.048 \times 2 π} \times e^{0} = \frac{1}{0.048 \times 2.507} = \frac{1}{0.1203} \approx 8.31$

f(0.85) = 8.31 > 1. This is valid — f(x) is a density, not a probability. Density can exceed 1; only the area under the curve must equal 1.

Inflection points at x = μ ± σ: take the second derivative of f(x) and set it to zero. After simplification, the second derivative is zero when (x−μ)²/σ² = 1, i.e., x = μ ± σ. These are the points where the curve transitions from concave down (near the center) to concave up (in the tails).

For the anchor: inflection points at 0.85 − 0.048 = 0.802 and 0.85 + 0.048 = 0.898.

CDF and Probability Computations

$F (x) = P (X \leq x) = Φ (\frac{x - μ}{σ})$

There is no closed-form antiderivative — the CDF is defined through the standard normal CDF Φ. Every query requires standardizing to a z-score first.

Four standard query types, all on the anchor (μ=0.85, σ=0.048):

Query 1 — P(X ≤ a): What fraction of folds score ≤ 0.90?

z = (0.90 − 0.85) / 0.048 = 1.042 → Φ(1.042) ≈ 0.851

Query 2 — P(X > a): What fraction of folds score above 0.88?

z = (0.88 − 0.85) / 0.048 = 0.625 → 1 − Φ(0.625) = 1 − 0.734 ≈ 0.266

Query 3 — P(a < X ≤ b): What fraction of folds score between 0.82 and 0.90?

z₁ = (0.82 − 0.85) / 0.048 = −0.625 → Φ(−0.625) = 0.266

z₂ = (0.90 − 0.85) / 0.048 = 1.042 → Φ(1.042) = 0.851

P = 0.851 − 0.266 = 0.585

Query 4 — Inverse (quantile): What accuracy corresponds to the 90th percentile?

x = μ + σ × Φ⁻¹(0.90) = 0.85 + 0.048 × 1.282 = 0.85 + 0.062 = 0.912

The Empirical Rule (68-95-99.7)

These numbers come from the standard normal CDF Φ — not guesswork:

$P (μ - σ < X < μ + σ) = Φ (1) - Φ (- 1) = 2Φ (1) - 1 = 2 (0.8413) - 1 = 0.6827$

$P (μ - 2 σ < X < μ + 2 σ) = Φ (2) - Φ (- 2) = 2 (0.9772) - 1 = 0.9545$

$P (μ - 3 σ < X < μ + 3 σ) = Φ (3) - Φ (- 3) = 2 (0.9987) - 1 = 0.9973$

Applied to the anchor (μ=0.85, σ=0.048):

Band	Range	Probability
±1σ	[0.802, 0.898]	68.27% of fold scores
±2σ	[0.754, 0.946]	95.45% of fold scores
±3σ	[0.706, 0.994]	99.73% of fold scores

The rule is exact only for normal data. For non-normal data, Chebyshev's inequality gives a weaker but universally valid bound: P(|X − μ| < kσ) ≥ 1 − 1/k². For k=2: at least 75% within ±2σ (versus 95.45% for Normal). For k=3: at least 88.9% within ±3σ (versus 99.73%). Chebyshev needs no normality assumption — only finite variance.

Mean, Variance, and Moments

E[X] = μ — by symmetry of f(x) around μ. The integral ∫(x−μ)f(x)dx = 0 because the integrand is odd.

Var(X) = σ² — by construction. Derivation: ∫(x−μ)² × f(x)dx = σ² (integration by parts, two steps).

Skewness = 0 — symmetric distribution; all odd central moments are zero.

Excess kurtosis = 0 — by definition. Normal is the reference distribution for kurtosis. The 4th central moment E[(X−μ)⁴] = 3σ⁴, giving kurtosis = 3 and excess kurtosis = 0.

General even moments: E[(X−μ)^{2k}] = (2k−1)!! × σ^{2k}, where !! is the double factorial. For k=2: 3!! × σ⁴ = 3σ⁴ ✓

Properties of Normal Distributions

1. Sum of independent normals: If X ~ N(μ₁, σ₁²) and Y ~ N(μ₂, σ₂²) independently, then X + Y ~ N(μ₁+μ₂, σ₁²+σ₂²).

Applied to the anchor: averaging 6 independent folds, the sample mean X̄ ~ N(0.85, σ²/6) = N(0.85, 0.048²/6) = N(0.85, 0.020²). The mean of means has the same center but 1/√6 ≈ 0.41× the spread.

2. Linear transformation: aX + b ~ N(aμ + b, a²σ²). Scaling and shifting a normal variable produces another normal variable.

3. Symmetry about μ: f(μ+x) = f(μ−x) for all x. Points equidistant from the mean have equal density.

4. Mode = Median = Mean = μ. All three measures of center coincide — a property unique to symmetric unimodal distributions.

5. Infinite support: f(x) > 0 for all x ∈ (−∞, +∞). Practically, P(|X − μ| > 4σ) < 0.006%, but the tails extend infinitely. A Normal model assigns nonzero probability to accuracy > 1 — physically impossible. The error is negligible when σ is small relative to the boundary distance.

Standard Normal and Z-Scores

Every Normal(μ, σ²) can be standardized: Z = (X − μ)/σ ~ Normal(0, 1).

This is why a single standard normal table works for all normal distributions — you standardize first, then use Φ. Full treatment: see the Standard Normal and Z-Score post.

Fitting to Data: MLE

The MLE estimates for Normal parameters are:

$\overset{μ}{^} = \overset{x}{ˉ} = \frac{1}{n} \sum x_{i}, \overset{σ}{^}^{2} = \frac{1}{n} \sum (x_{i} - \overset{x}{ˉ})^{2}$

Note: MLE uses n in the denominator, not n−1. The unbiased estimator uses n−1 (Bessel's correction), and np.std(ddof=1) returns the unbiased version.

For the anchor: μ̂ = 0.855, σ̂ (MLE) = 0.044, σ̂ (unbiased) = 0.048.

ML Applications

1. Residual normality in regression. Linear regression assumes ε ~ N(0, σ²). Residual Q-Q plots and Shapiro-Wilk are standard diagnostics — non-normal residuals may indicate a missing predictor or a wrong functional form.

2. Gaussian Naive Bayes. Estimates P(featureⱼ | class) as a Normal distribution per (feature, class) pair. Works well for continuous features like sensor readings or log-transformed counts.

3. Neural network weight initialization. He initialization: weights ~ N(0, 2/n_in). Xavier: N(0, 1/n_in). The Normal variance is chosen to keep gradient magnitudes stable across layers — too large causes explosion, too small causes vanishing.

4. Gaussian processes. A GP defines a prior distribution over functions using multivariate Normal distributions. The GP posterior (conditioned on observations) is also multivariate Normal — this is the closure property at work.

5. Confidence intervals and hypothesis tests. Nearly all parametric inference (t-tests, z-tests, regression inference) assumes normality of sampling distributions — justified by CLT for large samples, or explicit normality for small samples.

python

from scipy import stats
import numpy as np

accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88]
mu = np.mean(accuracy)
sigma = np.std(accuracy, ddof=1)  # unbiased
print(f"Fitted: mu={mu:.3f}, sigma={sigma:.3f}")

dist = stats.norm(mu, sigma)

# Four query types
print(f"\nP(X <= 0.90) = {dist.cdf(0.90):.4f}")
print(f"P(X > 0.88)  = {dist.sf(0.88):.4f}")
print(f"P(0.82 < X <= 0.90) = {dist.cdf(0.90) - dist.cdf(0.82):.4f}")
print(f"90th percentile = {dist.ppf(0.90):.4f}")

# Normality test
stat_sw, p_sw = stats.shapiro(accuracy)
print(f"\nShapiro-Wilk: W={stat_sw:.4f}, p={p_sw:.4f}")

# f(mu) > 1 demonstration
print(f"\nf(mu) = density at mean = {dist.pdf(mu):.4f}  (> 1 is valid for density)")

# Empirical rule verification
for k, label in [(1, '68.3%'), (2, '95.5%'), (3, '99.7%')]:
    p = dist.cdf(mu + k*sigma) - dist.cdf(mu - k*sigma)
    print(f"P(mu ± {k}sigma): {p:.4f}  (expect ~{label})")

text

Fitted: mu=0.855, sigma=0.048

P(X <= 0.90) = 0.7682
P(X > 0.88)  = 0.2660
P(0.82 < X <= 0.90) = 0.5500
90th percentile = 0.9162

Shapiro-Wilk: W=0.9453, p=0.7049

f(mu) = density at mean = 8.2893  (> 1 is valid for density)

P(mu ± 1sigma): 0.6827  (expect ~68.3%)
P(mu ± 2sigma): 0.9545  (expect ~95.5%)
P(mu ± 3sigma): 0.9973  (expect ~99.7%)

Shapiro-Wilk p=0.70 means no evidence against normality (expected — 6 points is not enough power to detect violations, but the anchor is pedagogically normal by construction).

Central Limit Theorem: the theorem that explains why the Normal appears so widely — formalizes the convergence behavior sketched in the introduction
Standard Normal / Z-Score: the standardized form Normal(0, 1) and its lookup tables — see the next post
Beta distribution: bounded [0, 1] alternative for accuracy, probability, or proportion data that is near the boundaries
t-distribution: Normal distribution's small-sample counterpart; accounts for uncertainty in σ estimation

Limitations

Unbounded support is physically wrong for accuracy scores. Normal(0.85, 0.048²) assigns nonzero probability to accuracy > 1. For n=6 folds with σ=0.048, this probability is negligible (z > 3.1), but for models near perfect accuracy, use Beta distribution.
No closed-form CDF. Every Normal probability requires numerical evaluation (erf function or tables). This adds implementation complexity compared to exponential or uniform.
Heavy tails in real data. Financial returns, prediction errors for rare events, and residuals from misspecified models often have fatter tails than Normal predicts — use t-distribution or stable distributions for heavy-tail modeling.

Test Your Understanding

The six fold scores are [0.82, 0.79, 0.91, 0.85, 0.78, 0.88]. Compute the z-score for the fold that scored 0.91 and determine what percentile it represents under the fitted Normal model.
A new training run scores 0.75. An engineer says "that's less than μ − 2σ, so it's definitely an outlier." Is this reasoning statistically correct? What additional consideration should be made when making outlier claims?
The spec states that MLE uses n (not n−1) in the variance denominator. If you fit μ̂ and σ̂² using MLE, is σ̂² an unbiased estimator of σ²? If not, what is its bias?
Explain why P(X > 1.0) for Normal(0.85, 0.048²) is technically positive but practically ignorable. For what type of model would this probability not be ignorable?
You are told that fold scores across 6 CV folds are independent and each fold's score is a mean over many test-set examples. Invoke the CLT to justify why modeling individual fold scores as Normal is reasonable, and state what assumption is required.

Normal/Gaussian Distribution

Why the Normal Distribution Is Central

The DS/ML Anchor

The PDF

CDF and Probability Computations

The Empirical Rule (68-95-99.7)

Mean, Variance, and Moments

Properties of Normal Distributions

Standard Normal and Z-Scores

Fitting to Data: MLE

ML Applications

Limitations

Test Your Understanding

Comments (0)

Leave a comment

Normal/Gaussian Distribution

Why the Normal Distribution Is Central

The DS/ML Anchor

The PDF

CDF and Probability Computations

The Empirical Rule (68-95-99.7)

Mean, Variance, and Moments

Properties of Normal Distributions

Standard Normal and Z-Scores

Fitting to Data: MLE

ML Applications

Related Concepts

Limitations

Test Your Understanding

Comments (0)

Leave a comment