~/blog

Central Limit Theorem

Apr 11, 2026•12 min read•By Mohammed Vasim

StatisticsMathData Science

Almost no real-world data is Normal. Latency is right-skewed. Error counts are Poisson. Click rates are Bernoulli. Yet every classical statistical tool — t-tests, z-tests, confidence intervals, ANOVA — assumes Normality somewhere. The Central Limit Theorem explains why those tools still work: not because the data is Normal, but because the sampling distribution of the mean converges to Normal regardless of the underlying distribution.

That is the theorem's power: it separates the shape of your data from the shape of your estimates.

Formal Statement

For independent, identically distributed (i.i.d.) random variables X₁, X₂, ..., X_n with mean μ and finite variance σ²:

(x̄_n − μ) / (σ/√n) → N(0, 1) as n → ∞

Equivalently: x̄_n ~ N(μ, σ²/n) approximately for large enough n

Every component unpacked:

x̄_n: the sample mean of n observations — what you actually compute
μ: the true population mean — what you want to estimate
σ/√n: the standard error — how much x̄ varies across samples of size n
The ratio: a Z-score for sample means; how many standard errors x̄ is from μ

The DS/ML Anchor

Model inference latency: right-skewed population with μ=45ms, σ=22ms (Gamma-distributed — most requests are fast, occasional slow outliers).

Applying the CLT to batches of n=30 requests:

x̄ ~ N(45, (22/√30)²) = N(45, 4.0²)

A batch mean of 55ms corresponds to Z = (55−45)/4.0 = 2.5 — p = 0.006. That is a meaningful latency spike.

Standard Deviation vs Standard Error

This distinction is constantly confused:

Standard deviation (σ=22ms): variability of individual requests from the population mean. Individual latency is noisy.
Standard error (SE=σ/√n): variability of sample means from the population mean. With n=30, batch means vary much less.

n	SE = 22/√n	What it means
1	22.0 ms	Single request — all population noise
5	9.8 ms	Batch of 5
30	4.0 ms	Typical monitoring window — CLT applies
100	2.2 ms	Production batch — very stable
400	1.1 ms	Large-scale — near-certain estimate

Key insight: as n quadruples, SE halves. Precision grows as √n, not n. To halve uncertainty you need 4× the data, not 2×.

Convergence — Phase by Phase

Phase 1: Population Distribution (n=1)

Individual latency requests follow a right-skewed distribution. The raw histogram is clearly non-Normal — most requests cluster near 30ms but a long tail extends past 100ms.

Phase 2: Batch Means (n=5 and n=10)

Phase 3: Convergence at n=30

The n≥30 Rule — and Why It's Just a Guideline

n≥30 is a rough rule of thumb for moderately skewed distributions. The truth is:

Symmetric distributions (Uniform, Normal): CLT kicks in at n=5–10
Moderately skewed (latency, income): n=30 is usually sufficient
Severely skewed (Pareto, heavy tails): need n=100 or more
Bernoulli (small p): need np ≥ 10 and n(1−p) ≥ 10

Why the CLT Enables Statistical Inference

Z-test and t-test: the test statistic Z = (x̄ − μ₀) / (σ/√n) is a standardized sample mean. It follows N(0,1) under H₀ because CLT guarantees x̄ is approximately Normal — not because latency is Normal.
Confidence intervals: x̄ ± z* × SE is valid only because the sampling distribution of x̄ is Normal. The ± is symmetric because the Normal is symmetric.
A/B testing: the difference in mean CTR or accuracy between two model versions is a difference of two sample means. Each is Normal by CLT; their difference is also Normal.
Mini-batch SGD: the gradient is an average over mini-batch samples. CLT says the gradient estimate is approximately Normally distributed — this underpins why mini-batch gradient descent with large enough batch size converges stably.

Conditions for the CLT

All three must hold for the CLT to apply:

Condition	Requirement	What breaks it	Fix
Independence	X₁,...,X_n are independent	Autocorrelated time series, clustered data	Use block bootstrap; model correlation explicitly
Identical distribution	All from same distribution	Mixed server regions with different latency profiles	Stratify before sampling; separate models per stratum
Finite variance	σ² < ∞	Cauchy distribution, Pareto with shape α ≤ 2	Use median (not mean); use robust statistics

For finite variance violation: the CLT does not apply to Cauchy or heavy-tailed Pareto. The sample mean never stabilizes — collecting more data does not make the mean estimate more precise. The Pareto example in the SVG above (α=1.5, infinite variance) illustrates this.

CLT for Sums

The CLT applies to sums too. If S_n = X₁ + X₂ + ... + X_n:

(S_n − nμ) / (σ√n) → N(0, 1)

Equivalently: S_n ~ N(nμ, nσ²) approximately

Poisson anchor exception: total errors in n=30 batches, where each batch has Poisson(λ=2) errors. Total errors:

S_30 ~ Poisson(30×2) = Poisson(60)

By CLT (Poisson(60) = sum of 30 independent Poisson(2)):

S_30 ~ N(60, 60) approximately (since mean=variance=60 for Poisson)

Code

python

import numpy as np
from scipy import stats

rng = np.random.default_rng(42)

# Gamma(shape=4.2, scale=10.7) → mean≈45ms, sigma≈22ms, right-skewed
population = rng.gamma(shape=4.2, scale=10.7, size=100_000)

print(f"Population: mean={population.mean():.1f}ms, std={population.std():.1f}ms")
print(f"Population skew: {stats.skew(population):.2f} (right-skewed)")
print()

# Simulate sampling distribution at different n
for n in [1, 5, 30, 100]:
    batch_means = [rng.choice(population, n).mean() for _ in range(5000)]
    se_theory = population.std() / np.sqrt(n)
    se_empirical = np.std(batch_means)
    skewness = stats.skew(batch_means)
    print(f"n={n:3d}: SE_theory={se_theory:.2f}  SE_empirical={se_empirical:.2f}  skew={skewness:.3f}")

# Apply to a monitoring scenario
print()
mu, sigma, n = 45, 22, 30
se = sigma / np.sqrt(n)
x_bar_observed = 55  # spike observed
z = (x_bar_observed - mu) / se
p_tail = stats.norm.sf(z)
print(f"Latency spike: x̄={x_bar_observed}ms, μ={mu}ms, SE={se:.2f}ms")
print(f"Z = ({x_bar_observed}-{mu})/{se:.2f} = {z:.2f}")
print(f"P(x̄ >= {x_bar_observed} | H0) = {p_tail:.4f}")

text

Population: mean=45.0ms, std=22.0ms
Population skew: 1.00 (right-skewed)

n=  1: SE_theory=22.00  SE_empirical=21.97  skew=0.993
n=  5: SE_theory= 9.84  SE_empirical= 9.79  skew=0.421
n= 30: SE_theory= 4.02  SE_empirical= 4.00  skew=0.178
n=100: SE_theory= 2.20  SE_empirical= 2.19  skew=0.063

Latency spike: x̄=55ms, μ=45ms, SE=4.02ms
Z = (55-45)/4.02 = 2.49
P(x̄ >= 55 | H0) = 0.0064

Reference Tables

CLT convergence by sample size:

n	SE (22/√n)	Distribution of x̄	Skew	Normal?
1	22.0 ms	Same as population	~1.0	No
5	9.8 ms	Mildly skewed	~0.42	Borderline
30	4.0 ms	Nearly Normal	~0.18	Yes (typical rule)
100	2.2 ms	Very close to Normal	~0.06	Yes

Conditions checklist:

Condition	Requirement	Violation	Fix
Independence	Observations independent	Correlated time series	Block bootstrap
Identical distribution	Same distribution	Mixed populations	Stratify before sampling
Finite variance	σ² < ∞	Cauchy, Pareto α≤2	Use median instead of mean

Limitations

n≥30 is a guideline, not a law. For heavy-tailed distributions, n=300 may be insufficient. Always verify empirically by checking skew of sample means.
Independence is often approximate. CV fold scores are roughly independent, but batch loss in SGD can be autocorrelated within an epoch.
The Normal approximation has error. Berry-Esseen theorem bounds the error: |F_n(x) − Φ(x)| ≤ C × E[|X−μ|³] / (σ³√n), where C ≈ 0.4748. More skewed populations require larger n to achieve the same approximation quality.

Test Your Understanding

Model batch loss has μ=0.42 and σ=0.18 across batches of n=25. Compute SE and state the approximate distribution of x̄. If you observe x̄=0.50, what is P(x̄ ≥ 0.50 | μ=0.42)?
You double batch size from 25 to 100. By what factor does SE decrease? What would you need to reduce SE to one-quarter of its current value?
Inference latency follows a Pareto distribution with shape α=1.5 (infinite variance). Can you apply the CLT to batch means of size n=50? What alternative statistic would you use instead of the mean?
Two CV experiments each use k=6 folds. Can you use the CLT to justify treating the difference in mean accuracy as approximately Normal? What assumptions would you need to make?
The CLT says x̄ converges to Normal regardless of the population distribution. Does this mean we can ignore the population distribution entirely in practice? What information from the population distribution (beyond μ and σ) still matters for choosing n?