Back to blog
← View series: statistics

~/blog

Central Limit Theorem

Apr 11, 20267 min readBy mohammed.vasim
StatisticsMathData Science

Training a neural network involves computing loss over thousands of mini-batches. Each mini-batch loss is a noisy, non-normal random variable — it fluctuates based on which examples were sampled. Yet the average loss across many mini-batches, the number you actually monitor on a training dashboard, follows a remarkably smooth, predictable pattern that looks almost Normal. This is not a coincidence. It is the Central Limit Theorem at work, and it is the reason every statistical tool built on Normal assumptions actually works for model evaluation, A/B testing, and experiment analysis.

Take any distribution — exponential, uniform, Poisson, bimodal — and start averaging samples from it. Something fundamental happens to those averages: they converge to a Normal distribution, regardless of the original shape. That convergence is what makes confidence intervals, hypothesis tests, and p-values defensible even when your raw data is far from Normal.

The Math Behind It

If you have independent, identically distributed random variables with mean and variance , and you compute their sample mean , then:

Or equivalently:

The double arrow means "converges in distribution to." For large enough , the sample mean is approximately normally distributed with mean and variance .

The standard error tells you how tightly those sample means cluster around the true mean. Larger samples produce a narrower, taller distribution of means — less variability in your estimates.

A Concrete ML Dataset

Suppose a model serves predictions and you log the response latency (in milliseconds) for each request. Latency is heavily right-skewed — most requests take 20–40ms but occasional outliers stretch to 500ms or more. You sample mini-batches of requests and compute the mean latency per batch. Here is what happens to those batch means:

Batch size Shape of batch-mean distributionStandard error
1Skewed right (matches raw latency) ms
10Mild right skew ms
30Near-Normal ms
100Essentially Normal ms

The raw latency distribution never changes. Only the aggregation level changes. Yet by the batch means are Normal enough that you can apply z-scores, build confidence intervals, and run two-sample comparisons between model versions.

n = 1 (raw latency) Right-skewed mean = 60ms n = 10 Mild skew mean = 60ms, SE≈25ms n = 30 Near-Normal mean = 60ms, SE≈14.6ms

Calculation Walk-Through

A model's per-request latency has true mean ms and standard deviation ms (reflecting the skewed distribution). You observe a batch of requests with mean latency ms. What is the probability of seeing a batch mean this high or higher by chance?

PhaseFormulaValuesResult
Standard error ms
Z-score
Upper tail probability
InterpretationCompare to Not unusual

A batch mean of 68 ms happens about 16% of the time even when nothing is wrong. You would need — a batch mean above ms — before flagging a statistically significant spike.

60ms (μ) 68ms 75.7ms Z = 1.0 Z = 1.96 reject region p(Z≥1.0) ≈ 16%

Python Code

python
import numpy as np
from scipy import stats

np.random.seed(42)

# Simulate skewed per-request latency (exponential + constant)
def sample_latencies(n_requests, n_batches=5000):
    raw = np.random.exponential(scale=50, size=(n_batches, n_requests)) + 10
    return raw.mean(axis=1)

mu_latency = 60  # ms
sigma_latency = 80  # ms (high due to skew)

for batch_size in [1, 10, 30, 100]:
    batch_means = sample_latencies(batch_size)
    se_theoretical = sigma_latency / np.sqrt(batch_size)
    se_empirical = batch_means.std()
    print(f"n={batch_size:3d}: SE_theoretical={se_theoretical:.1f}ms  SE_empirical={se_empirical:.1f}ms")
n= 1: SE_theoretical=80.0ms SE_empirical=49.8ms n= 10: SE_theoretical=25.3ms SE_empirical=15.7ms n= 30: SE_theoretical=14.6ms SE_empirical=9.1ms n=100: SE_theoretical=8.0ms SE_empirical=5.0ms

Why Does This Work?

When you add up many small, independent random effects, positive deviations and negative deviations start to cancel each other out. The characteristic function (Fourier transform of the density) of any finite-variance distribution can be Taylor-expanded, and the higher-order terms vanish as increases, leaving exactly the Gaussian characteristic function. The convergence rate is — the Berry-Esseen theorem makes this precise:

where . Doubling the sample size cuts the approximation error in half.

For Sums

If , then . This matters when you are working with total rather than average quantities — total tokens processed, total prediction errors, total revenue.

Multivariate Extension

For random vectors, the CLT extends to:

where is the covariance matrix. This underpins multivariate hypothesis tests on model output vectors.

Assumptions and Honest Limitations

The CLT carries real requirements:

Independence: Observations must be independent. If latency measurements are autocorrelated (one slow request causes the next to queue), the CLT does not apply and convergence can be much slower.

Finite variance: The Cauchy distribution has no variance and no CLT. Heavy-tailed distributions with in a power-law tail also fail to converge.

Sample size: For roughly symmetric distributions, is usually sufficient. For highly skewed distributions like latency, you may need before the Normal approximation is reliable.

The n=30 rule is a rough guideline, not a law. If your raw data looks like a heavily skewed exponential, check the Normal approximation empirically before trusting it.

The CLT builds directly on the law of large numbers, which guarantees sample means converge to the true mean; the CLT adds the shape of that convergence. It unlocks the z-test and t-test (next posts in this series), confidence intervals (post 11), and p-values (post 4) — every classical inference tool assumes CLT-justified Normality somewhere. It also connects to the delta method for propagating uncertainty through nonlinear functions of means, which matters for comparing model metrics like F1 or AUC across experiments.

Test Your Understanding

  1. A model's per-batch cross-entropy loss has mean 0.4 and standard deviation 0.15 across batches of size . What is the standard error of the mean batch loss, and what shape does the distribution of batch means have?
  2. You double the batch size from 25 to 100. By what factor does the standard error of the mean change?
  3. Your raw prediction errors follow a Poisson distribution with . Is the CLT applicable for batch means with ? What assumption must you verify?
  4. An engineer argues that because individual prediction errors are non-Normal, you cannot use z-scores to compare mean errors across two model versions with each. Is this argument correct? Why or why not?
  5. Two teams each run 50-batch experiments on different hardware. Team A's batches are independent. Team B's batches share a GPU queue, so slow batches cause subsequent batches to slow down. Which team can safely apply the CLT, and why?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment