← View series: statistics
~/blog
Law of Large Numbers
After six cross-validation folds, the sample mean of CV accuracy scores is 0.838. You report this as your model's expected performance. But the sample mean is a random variable — run six different folds and you'd get a different number. Why trust it as an estimate of the true population mean μ? The Law of Large Numbers (LLN) is the theorem that answers this question. It proves that as the sample size grows, the sample mean converges to the true mean. Without LLN, every use of x̄ to estimate μ would be unjustified.
Anchor: accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88] — six CV fold scores. For simulations: Bernoulli(p=0.85) draws, representing whether each fold score exceeds a threshold.
What LLN Answers
Two things are true simultaneously and must be held separately:
-
The sample mean x̄ is random. Draw a different sample of size n and you get a different x̄. The sample mean is a function of random data — it is itself a random variable.
-
The population mean μ is fixed. It's a property of the distribution, not the sample. It does not change across experiments.
LLN says: the gap |x̄ₙ − μ| shrinks to zero as n → ∞. The sample mean, despite being random, is guaranteed to converge to the fixed target μ. This is what makes statistical estimation valid.
Running means from the anchor (computing after each fold):
| Folds observed | Values seen | Running mean x̄ |
|---|---|---|
| 1 | [0.82] | 0.820 |
| 2 | [0.82, 0.79] | 0.805 |
| 3 | [0.82, 0.79, 0.91] | 0.840 |
| 4 | [0.82, 0.79, 0.91, 0.85] | 0.843 |
| 5 | [0.82, 0.79, 0.91, 0.85, 0.78] | 0.830 |
| 6 | [0.82, 0.79, 0.91, 0.85, 0.78, 0.88] | 0.838 |
With only 6 folds the running mean still oscillates. LLN guarantees that if we ran 100 or 1000 folds, it would stabilize.
Weak Law of Large Numbers (WLLN)
Formal statement: For any ε > 0:
lim_{n→∞} P(|x̄ₙ − μ| > ε) = 0
In words: the probability that the sample mean deviates from μ by more than any fixed ε goes to zero as n grows.
"Weak" means convergence in probability — for each fixed n, there is still some chance of being far from μ, but this chance vanishes as n increases.
Derivation via Chebyshev's inequality:
Chebyshev's inequality states: for any random variable X with mean μ and variance σ²:
P(|X − μ| ≥ k) ≤ σ²/k²
Apply this to x̄ₙ. The sample mean of n i.i.d. draws has:
- Mean: E[x̄ₙ] = μ
- Variance: Var(x̄ₙ) = σ²/n
Setting k = ε:
P(|x̄ₙ − μ| ≥ ε) ≤ Var(x̄ₙ) / ε² = σ²/(nε²)
As n → ∞, σ²/(nε²) → 0 for any fixed σ² and ε. Therefore the probability goes to zero. ∎
Key observation: the only requirement is finite variance (σ² < ∞). No normality assumption is needed. LLN is distribution-free.
For the anchor: σ² ≈ 0.00228 (variance of the 6 fold scores). For ε = 0.05 (being within 5 percentage points of μ):
Upper bound on P(|x̄₆ − μ| > 0.05) ≤ 0.00228 / (6 × 0.05²) = 0.00228 / 0.015 ≈ 0.152
With 100 folds: 0.00228 / (100 × 0.0025) = 0.152 / 100 × 6 ≈ 0.009 — a 99.1% guarantee of being within 5pp of μ.
Strong Law of Large Numbers (SLLN)
Formal statement:
P(lim_{n→∞} x̄ₙ = μ) = 1
In words: with probability 1, the sample mean converges to μ. Not just probably — almost surely.
SLLN vs WLLN:
| Weak LLN | Strong LLN | |
|---|---|---|
| Convergence type | In probability | Almost surely |
| Statement | P(|x̄ₙ−μ| > ε) → 0 as n→∞ | P(x̄ₙ → μ) = 1 |
| Requirement | Finite variance σ² < ∞ | Finite first moment E[|X|] < ∞ |
| Implication | SLLN → WLLN (but not vice versa) | Stronger claim |
Intuition for the difference: Consider an infinite sequence of fair coin flips. WLLN says: for any fixed n, the probability that the heads proportion is more than ε away from 0.5 goes to zero as n grows. SLLN says: with probability 1, the entire infinite sequence of heads proportions eventually converges to 0.5 and stays there. SLLN is a statement about the whole trajectory, not just any single n.
Convergence Simulation
The first 6 amber dots are the actual anchor data running means. The blue line extends to 100 simulated folds — the oscillations shrink and the mean locks onto μ = 0.85.
LLN vs Central Limit Theorem
These theorems are complementary and both required for statistical inference. They answer different questions:
| Law of Large Numbers | Central Limit Theorem | |
|---|---|---|
| Question | Does x̄ converge to μ? | How is x̄ distributed around μ? |
| Answer | Yes, x̄ → μ as n → ∞ | x̄ ~ Normal(μ, σ²/n) for large n |
| What it gives | Convergence guarantee | Distribution shape of the estimate |
| Use case | Justifies using x̄ to estimate μ | Enables confidence intervals + hypothesis tests |
LLN tells you the sample mean will eventually land on μ. CLT tells you how the distribution of the sample mean is shaped at any finite n — it's approximately Normal with variance σ²/n.
Monte Carlo Simulation and LLN
LLN is the mathematical justification for Monte Carlo estimation. To compute E[f(X)] when the integral is analytically intractable, draw n samples from the distribution and average:
(1/n) Σᵢ f(xᵢ) → E[f(X)] as n → ∞
By LLN, this average converges to the true expectation. The accuracy of the estimate improves with more samples — which is why Monte Carlo is described as "embarrassingly parallelizable" and why doubling samples halves the variance of the estimate.
Concrete example: Estimate P(accuracy > 0.85) from the anchor distribution.
import numpy as np
rng = np.random.default_rng(42)
# Simulate accuracy scores: Normal(μ=0.838, σ=0.048)
# (using actual anchor mean and std)
mu, sigma = 0.838, 0.048
# Running Monte Carlo estimate of P(accuracy > 0.85)
samples = rng.normal(loc=mu, scale=sigma, size=10_000)
running_estimate = np.cumsum(samples > 0.85) / np.arange(1, 10_001)
print(f"n=10 estimate: {running_estimate[9]:.3f}")
print(f"n=100 estimate: {running_estimate[99]:.3f}")
print(f"n=1000 estimate: {running_estimate[999]:.3f}")
print(f"n=10000 estimate: {running_estimate[9999]:.3f}")
# Analytical: P(Z > (0.85-0.838)/0.048) = P(Z > 0.25) ≈ 0.401n=10 estimate: 0.400
n=100 estimate: 0.410
n=1000 estimate: 0.398
n=10000 estimate: 0.401
The estimate converges to ≈0.40. With n=10 it's already close; with n=10,000 it's stable. LLN guarantees this convergence — the more samples, the tighter the bound from Chebyshev.
When LLN Fails
LLN requires: i.i.d. samples (independent and identically distributed) with finite mean. Three practical violations:
1. Infinite variance: Cauchy distribution
The Cauchy distribution has no defined mean or variance. The sample mean of Cauchy-distributed data does NOT converge — it remains a Cauchy distribution regardless of n. The running mean never stabilizes:
import numpy as np
rng = np.random.default_rng(0)
cauchy_samples = rng.standard_cauchy(size=2000)
running_mean_cauchy = np.cumsum(cauchy_samples) / np.arange(1, 2001)
# Compare: normal samples converge, Cauchy does not
normal_samples = rng.normal(0, 1, size=2000)
running_mean_normal = np.cumsum(normal_samples) / np.arange(1, 2001)
print(f"Normal running mean at n=2000: {running_mean_normal[-1]:.3f} (converges to 0)")
print(f"Cauchy running mean at n=2000: {running_mean_cauchy[-1]:.3f} (erratic, no convergence)")Normal running mean at n=2000: 0.021 (converges to 0)
Cauchy running mean at n=2000: -7.843 (erratic, no convergence)
Heavy-tailed distributions in finance (returns), network traffic (packet sizes), and natural language (word frequencies) can exhibit Cauchy-like tail behavior. Sample means from such distributions converge slowly or not at all.
2. Non-identically distributed samples (concept drift)
If the distribution changes over time — seasonal user behavior, evolving model performance, shifting data sources — the samples are not identically distributed. The sample mean converges to a weighted combination of the changing distributions, not the target distribution's mean. This is why deploying a model trained six months ago on a product with strong seasonality can fail: the historical mean is not the current mean.
3. Dependent samples (autocorrelation)
LLN requires independence or at most weak dependence. In time series, spatial data, or any measurement where consecutive observations are correlated, the effective sample size is smaller than n. The variance of x̄ₙ does not decrease as fast as σ²/n — it's σ²×(1+2ρ)/(n) approximately (for AR(1) with autocorrelation ρ). LLN still holds under weak dependence, but the convergence is slower than independent sampling would suggest.
ML implications:
- Train/test data collected from a changing user population: historical sample means underestimate or overestimate future behavior
- Time series forecasting: sample statistics from correlated observations are less efficient than from i.i.d. data; confidence intervals that assume independence are too narrow
- Evaluation: if CV folds are not independent (leakage), the variance of x̄ is underestimated and your convergence is slower than you think
The Gambler's Fallacy
LLN is commonly misread to support the Gambler's Fallacy. The fallacy: "I've seen 5 above-average folds in a row; the next must be below average to bring the mean down."
This is wrong.
LLN says: the long-run average converges to μ because n grows large, which makes individual deviations statistically negligible compared to the total. It does NOT say that past above-average values create a compensating pressure for future below-average values.
Each fold in a CV run is independent. A fold with accuracy 0.91 does not "owe" the system a future fold of 0.79. The average converges because the influence of any finite set of early outliers shrinks as n grows — not because the process self-corrects.
The correct intuition: with 1000 folds, 5 early high folds contribute 5/1000 of the total. Their effect is diluted, not canceled.
Related Concepts
LLN and the Central Limit Theorem together form the foundation of frequentist statistical inference. LLN says x̄ converges to μ (the estimate is valid); CLT says x̄ is approximately Normal with variance σ²/n (the distribution of the estimate is known). Confidence intervals require both: LLN to justify that x̄ is a useful point estimate, CLT to construct the interval around it. Monte Carlo methods are applied LLN: replacing analytically intractable expectations with sample averages. The Chebyshev derivation of WLLN connects directly to concentration inequalities used in PAC learning theory to bound generalization error.
Honest Limitations
The Chebyshev bound is conservative — the actual probability of deviation is often much smaller than σ²/(nε²) implies. For distributions known to be Normal, Hoeffding's inequality or exact normal distribution calculations give tighter bounds. More importantly, LLN tells you convergence happens but not how fast it happens in practice. With a high-variance distribution, you may need very large n before the sample mean is reliably close to μ. The sample size required for a given precision depends on σ², and σ² must be estimated from data — adding another layer of uncertainty.
Test Your Understanding
-
The CV anchor has σ² ≈ 0.00228. Using the Chebyshev bound, how many folds are needed to guarantee P(|x̄ₙ − μ| > 0.02) ≤ 0.05? Show the derivation.
-
Explain in one paragraph why the Gambler's Fallacy is a misreading of LLN. What does LLN actually say about individual observations vs long-run averages?
-
A company collects user session durations. Session data is highly right-skewed. Can they still apply LLN to justify using the sample mean? What condition must hold, and what would make it fail?
-
The strong and weak laws differ in their convergence type. Give a concrete scenario where WLLN holds but SLLN would require more conditions — then state what additional conditions SLLN requires.
-
A Monte Carlo simulation uses 500 samples to estimate E[f(X)]. If you quadruple the sample count to 2000, by what factor does the variance of your estimate decrease? How does this relate to LLN?