~/blog

Law of Large Numbers

Jun 21, 2026•11 min read•By Mohammed Vasim

StatisticsMathData Science

After six cross-validation folds, the sample mean of CV accuracy scores is 0.838. You report this as your model's expected performance. But the sample mean is a random variable — run six different folds and you'd get a different number. Why trust it as an estimate of the true population mean μ? The Law of Large Numbers (LLN) is the theorem that answers this question. It proves that as the sample size grows, the sample mean converges to the true mean. Without LLN, every use of x̄ to estimate μ would be unjustified.

Anchor: accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88] — six CV fold scores. For simulations: Bernoulli(p=0.85) draws, representing whether each fold score exceeds a threshold.

What LLN Answers

Two things are true simultaneously and must be held separately:

The sample mean x̄ is random. Draw a different sample of size n and you get a different x̄. The sample mean is a function of random data — it is itself a random variable.
The population mean μ is fixed. It's a property of the distribution, not the sample. It does not change across experiments.

LLN says: the gap |x̄ₙ − μ| shrinks to zero as n → ∞. The sample mean, despite being random, is guaranteed to converge to the fixed target μ. This is what makes statistical estimation valid.

Running means from the anchor (computing after each fold):

Folds observed	Values seen	Running mean x̄
1	[0.82]	0.820
2	[0.82, 0.79]	0.805
3	[0.82, 0.79, 0.91]	0.840
4	[0.82, 0.79, 0.91, 0.85]	0.843
5	[0.82, 0.79, 0.91, 0.85, 0.78]	0.830
6	[0.82, 0.79, 0.91, 0.85, 0.78, 0.88]	0.838

With only 6 folds the running mean still oscillates. LLN guarantees that if we ran 100 or 1000 folds, it would stabilize.

Weak Law of Large Numbers (WLLN)

Formal statement: For any ε > 0:

lim_{n→∞} P(|x̄ₙ − μ| > ε) = 0

In words: the probability that the sample mean deviates from μ by more than any fixed ε goes to zero as n grows.

"Weak" means convergence in probability — for each fixed n, there is still some chance of being far from μ, but this chance vanishes as n increases.

Derivation via Chebyshev's inequality:

Chebyshev's inequality states: for any random variable X with mean μ and variance σ²:

P(|X − μ| ≥ k) ≤ σ²/k²

Apply this to x̄ₙ. The sample mean of n i.i.d. draws has:

Mean: E[x̄ₙ] = μ
Variance: Var(x̄ₙ) = σ²/n

Setting k = ε:

P(|x̄ₙ − μ| ≥ ε) ≤ Var(x̄ₙ) / ε² = σ²/(nε²)

As n → ∞, σ²/(nε²) → 0 for any fixed σ² and ε. Therefore the probability goes to zero. ∎

Key observation: the only requirement is finite variance (σ² < ∞). No normality assumption is needed. LLN is distribution-free.

For the anchor: σ² ≈ 0.00228 (variance of the 6 fold scores). For ε = 0.05 (being within 5 percentage points of μ):

Upper bound on P(|x̄₆ − μ| > 0.05) ≤ 0.00228 / (6 × 0.05²) = 0.00228 / 0.015 ≈ 0.152

With 100 folds: 0.00228 / (100 × 0.0025) = 0.152 / 100 × 6 ≈ 0.009 — a 99.1% guarantee of being within 5pp of μ.

Strong Law of Large Numbers (SLLN)

Formal statement:

P(lim_{n→∞} x̄ₙ = μ) = 1

In words: with probability 1, the sample mean converges to μ. Not just probably — almost surely.

SLLN vs WLLN:

	Weak LLN	Strong LLN
Convergence type	In probability	Almost surely
Statement	P(\|x̄ₙ−μ\| > ε) → 0 as n→∞	P(x̄ₙ → μ) = 1
Requirement	Finite variance σ² < ∞	Finite first moment E[\|X\|] < ∞
Implication	SLLN → WLLN (but not vice versa)	Stronger claim

Intuition for the difference: Consider an infinite sequence of fair coin flips. WLLN says: for any fixed n, the probability that the heads proportion is more than ε away from 0.5 goes to zero as n grows. SLLN says: with probability 1, the entire infinite sequence of heads proportions eventually converges to 0.5 and stays there. SLLN is a statement about the whole trajectory, not just any single n.

Convergence Simulation

0.70 0.80 0.90

μ=0.85

1 20 40 60 80 100 Number of folds (n)

Running mean True mean μ Actual 6 folds

The first 6 amber dots are the actual anchor data running means. The blue line extends to 100 simulated folds — the oscillations shrink and the mean locks onto μ = 0.85.

LLN vs Central Limit Theorem

These theorems are complementary and both required for statistical inference. They answer different questions:

	Law of Large Numbers	Central Limit Theorem
Question	Does x̄ converge to μ?	How is x̄ distributed around μ?
Answer	Yes, x̄ → μ as n → ∞	x̄ ~ Normal(μ, σ²/n) for large n
What it gives	Convergence guarantee	Distribution shape of the estimate
Use case	Justifies using x̄ to estimate μ	Enables confidence intervals + hypothesis tests

LLN tells you the sample mean will eventually land on μ. CLT tells you how the distribution of the sample mean is shaped at any finite n — it's approximately Normal with variance σ²/n.

LLN: Convergence Story One trajectory, running mean over time

μ n → high low x̄ₙ locks onto μ as n grows

CLT: Distribution Story Many samples of size n, distribution of x̄

μ small n large n x̄ ~ Normal(μ, σ²/n) — tighter with larger n

Monte Carlo Simulation and LLN

LLN is the mathematical justification for Monte Carlo estimation. To compute E[f(X)] when the integral is analytically intractable, draw n samples from the distribution and average:

(1/n) Σᵢ f(xᵢ) → E[f(X)] as n → ∞

By LLN, this average converges to the true expectation. The accuracy of the estimate improves with more samples — which is why Monte Carlo is described as "embarrassingly parallelizable" and why doubling samples halves the variance of the estimate.

Concrete example: Estimate P(accuracy > 0.85) from the anchor distribution.

python

import numpy as np

rng = np.random.default_rng(42)
# Simulate accuracy scores: Normal(μ=0.838, σ=0.048)
# (using actual anchor mean and std)
mu, sigma = 0.838, 0.048

# Running Monte Carlo estimate of P(accuracy > 0.85)
samples = rng.normal(loc=mu, scale=sigma, size=10_000)
running_estimate = np.cumsum(samples > 0.85) / np.arange(1, 10_001)

print(f"n=10     estimate: {running_estimate[9]:.3f}")
print(f"n=100    estimate: {running_estimate[99]:.3f}")
print(f"n=1000   estimate: {running_estimate[999]:.3f}")
print(f"n=10000  estimate: {running_estimate[9999]:.3f}")
# Analytical: P(Z > (0.85-0.838)/0.048) = P(Z > 0.25) ≈ 0.401

n=10     estimate: 0.400
n=100    estimate: 0.410
n=1000   estimate: 0.398
n=10000  estimate: 0.401

The estimate converges to ≈0.40. With n=10 it's already close; with n=10,000 it's stable. LLN guarantees this convergence — the more samples, the tighter the bound from Chebyshev.

When LLN Fails

LLN requires: i.i.d. samples (independent and identically distributed) with finite mean. Three practical violations:

1. Infinite variance: Cauchy distribution

The Cauchy distribution has no defined mean or variance. The sample mean of Cauchy-distributed data does NOT converge — it remains a Cauchy distribution regardless of n. The running mean never stabilizes:

python

import numpy as np
rng = np.random.default_rng(0)

cauchy_samples = rng.standard_cauchy(size=2000)
running_mean_cauchy = np.cumsum(cauchy_samples) / np.arange(1, 2001)

# Compare: normal samples converge, Cauchy does not
normal_samples = rng.normal(0, 1, size=2000)
running_mean_normal = np.cumsum(normal_samples) / np.arange(1, 2001)

print(f"Normal running mean at n=2000: {running_mean_normal[-1]:.3f}  (converges to 0)")
print(f"Cauchy running mean at n=2000: {running_mean_cauchy[-1]:.3f}  (erratic, no convergence)")

Normal running mean at n=2000:  0.021  (converges to 0)
Cauchy running mean at n=2000: -7.843  (erratic, no convergence)

Heavy-tailed distributions in finance (returns), network traffic (packet sizes), and natural language (word frequencies) can exhibit Cauchy-like tail behavior. Sample means from such distributions converge slowly or not at all.

2. Non-identically distributed samples (concept drift)

If the distribution changes over time — seasonal user behavior, evolving model performance, shifting data sources — the samples are not identically distributed. The sample mean converges to a weighted combination of the changing distributions, not the target distribution's mean. This is why deploying a model trained six months ago on a product with strong seasonality can fail: the historical mean is not the current mean.

3. Dependent samples (autocorrelation)

LLN requires independence or at most weak dependence. In time series, spatial data, or any measurement where consecutive observations are correlated, the effective sample size is smaller than n. The variance of x̄ₙ does not decrease as fast as σ²/n — it's σ²×(1+2ρ)/(n) approximately (for AR(1) with autocorrelation ρ). LLN still holds under weak dependence, but the convergence is slower than independent sampling would suggest.

ML implications:

Train/test data collected from a changing user population: historical sample means underestimate or overestimate future behavior
Time series forecasting: sample statistics from correlated observations are less efficient than from i.i.d. data; confidence intervals that assume independence are too narrow
Evaluation: if CV folds are not independent (leakage), the variance of x̄ is underestimated and your convergence is slower than you think

The Gambler's Fallacy

LLN is commonly misread to support the Gambler's Fallacy. The fallacy: "I've seen 5 above-average folds in a row; the next must be below average to bring the mean down."

This is wrong.

LLN says: the long-run average converges to μ because n grows large, which makes individual deviations statistically negligible compared to the total. It does NOT say that past above-average values create a compensating pressure for future below-average values.

Each fold in a CV run is independent. A fold with accuracy 0.91 does not "owe" the system a future fold of 0.79. The average converges because the influence of any finite set of early outliers shrinks as n grows — not because the process self-corrects.

The correct intuition: with 1000 folds, 5 early high folds contribute 5/1000 of the total. Their effect is diluted, not canceled.

LLN and the Central Limit Theorem together form the foundation of frequentist statistical inference. LLN says x̄ converges to μ (the estimate is valid); CLT says x̄ is approximately Normal with variance σ²/n (the distribution of the estimate is known). Confidence intervals require both: LLN to justify that x̄ is a useful point estimate, CLT to construct the interval around it. Monte Carlo methods are applied LLN: replacing analytically intractable expectations with sample averages. The Chebyshev derivation of WLLN connects directly to concentration inequalities used in PAC learning theory to bound generalization error.

Honest Limitations

The Chebyshev bound is conservative — the actual probability of deviation is often much smaller than σ²/(nε²) implies. For distributions known to be Normal, Hoeffding's inequality or exact normal distribution calculations give tighter bounds. More importantly, LLN tells you convergence happens but not how fast it happens in practice. With a high-variance distribution, you may need very large n before the sample mean is reliably close to μ. The sample size required for a given precision depends on σ², and σ² must be estimated from data — adding another layer of uncertainty.

Test Your Understanding

The CV anchor has σ² ≈ 0.00228. Using the Chebyshev bound, how many folds are needed to guarantee P(|x̄ₙ − μ| > 0.02) ≤ 0.05? Show the derivation.
Explain in one paragraph why the Gambler's Fallacy is a misreading of LLN. What does LLN actually say about individual observations vs long-run averages?
A company collects user session durations. Session data is highly right-skewed. Can they still apply LLN to justify using the sample mean? What condition must hold, and what would make it fail?
The strong and weak laws differ in their convergence type. Give a concrete scenario where WLLN holds but SLLN would require more conditions — then state what additional conditions SLLN requires.
A Monte Carlo simulation uses 500 samples to estimate E[f(X)]. If you quadruple the sample count to 2000, by what factor does the variance of your estimate decrease? How does this relate to LLN?

Law of Large Numbers

What LLN Answers

Weak Law of Large Numbers (WLLN)

Strong Law of Large Numbers (SLLN)

Convergence Simulation

LLN vs Central Limit Theorem

Monte Carlo Simulation and LLN

When LLN Fails

The Gambler's Fallacy

Honest Limitations

Test Your Understanding

Comments (0)

Leave a comment

Law of Large Numbers

What LLN Answers

Weak Law of Large Numbers (WLLN)

Strong Law of Large Numbers (SLLN)

Convergence Simulation

LLN vs Central Limit Theorem

Monte Carlo Simulation and LLN

When LLN Fails

The Gambler's Fallacy

Related Concepts

Honest Limitations

Test Your Understanding

Comments (0)

Leave a comment