← View series: statistics
~/blog
Sampling Distributions
You train a model and get x̄ = 0.855 mean accuracy across 6 CV folds. You use this to estimate the model's true accuracy μ. But if you re-ran the same experiment with different data splits, you would get a different x̄. The sample mean is not a fixed number — it is a random variable that follows its own distribution. That distribution is the sampling distribution, and understanding it is what separates black-box use of p-values from genuine understanding of statistical inference.
The Core Questions
- How much does x̄ vary across different samples of size n?
- What distribution does x̄ follow?
- How does n affect this variability?
Without answering these, confidence intervals and hypothesis tests are formulas you apply without understanding what they measure.
The DS/ML Anchor
Six CV fold accuracy scores: accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88]
Sample mean: x̄ = 0.855. Sample SD: s ≈ 0.048.
We treat this as one draw from the sampling distribution of the mean. We want to characterize all possible x̄ values we could observe.
Data Distribution vs Sampling Distribution
This distinction is the conceptual core of inferential statistics:
| Data Distribution | Sampling Distribution of x̄ | |
|---|---|---|
| What is plotted | Individual fold scores | All possible sample means |
| One observation | Single fold accuracy (e.g., 0.82) | Mean of 6 folds (e.g., 0.855) |
| Center | μ (true population mean) | μ (same) |
| Spread | σ = 0.048 (population SD) | σ/√n = 0.048/√6 ≈ 0.020 |
| Shape | Whatever the population has | Approaches Normal (CLT) |
| How to get it | Observe data | Repeat the study many times |
The sampling distribution is narrower than the data distribution by a factor of √n. With n=6, the sample mean varies 2.45× less than individual observations.
Simulation — Making the Abstract Concrete
The sampling distribution can be simulated directly. Draw 10,000 samples of size 6 from Normal(0.855, 0.048) and compute the mean of each:
import numpy as np
import matplotlib.pyplot as plt
rng = np.random.default_rng(42)
mu, sigma, n = 0.855, 0.048, 6
# Simulate 10,000 samples of size 6
sample_means = [rng.normal(mu, sigma, size=n).mean() for _ in range(10_000)]
mean_of_means = np.mean(sample_means)
sd_of_means = np.std(sample_means)
print(f"Mean of sample means: {mean_of_means:.4f} (expected μ = {mu})")
print(f"SD of sample means: {sd_of_means:.4f} (expected SE = {sigma/np.sqrt(n):.4f})")Mean of sample means: 0.8550 (expected μ = 0.855)
SD of sample means: 0.0196 (expected SE = 0.0196)
Two results confirmed by simulation:
- Unbiasedness: the mean of sample means ≈ μ. x̄ is an unbiased estimator of μ.
- SE formula: the SD of sample means ≈ σ/√n = 0.0196. This is the standard error.
Standard Error of the Mean
SE = σ/√n (known σ) or SE = s/√n (estimated from sample)
Derivation from first principles:
x̄ = (1/n) × (X₁ + X₂ + ... + X_n)
Var(x̄) = Var[(1/n) × ΣXᵢ] = (1/n²) × Var(ΣXᵢ)
= (1/n²) × n × σ² (using independence of X₁, ..., X_n)
= σ²/n
SE = √(Var(x̄)) = σ/√n
The independence assumption is critical — it fails for autocorrelated time series, clustered data, or repeated measures. In those cases, the formula underestimates the true SE.
Anchor: SE = 0.048/√6 ≈ 0.0196
Effect of sample size on SE:
SE vs SD — a required distinction:
| SD (s or σ) | SE (s/√n) | |
|---|---|---|
| What it measures | Spread of individual observations from the mean | Spread of sample means from the true mean |
| Decreases with more data? | No — converges to σ | Yes — shrinks as 1/√n |
| Answers | "How variable are individual fold scores?" | "How precisely does x̄ estimate μ?" |
Saying "SE = 0.02" does not mean the data has low variability — it means the sample mean is a precise estimate. The data could have SD = 0.048 and SE = 0.020 simultaneously.
Sampling Distribution for Proportions
When you compute p̂ = fraction of folds above a threshold, p̂ is also a sample statistic with its own sampling distribution.
Example: P(fold accuracy > 0.85) = p = 0.60 (true proportion). With n=50 folds:
- E[p̂] = p = 0.60 (unbiased)
- SE(p̂) = √(p(1−p)/n) = √(0.60×0.40/50) = 0.069
- Normal approximation valid when: np = 30 ≥ 10 and n(1−p) = 20 ≥ 10 ✓
Sampling Distribution for the Variance
The sample variance s² does not follow a normal sampling distribution — it follows a scaled chi-square:
(n−1)s² / σ² ~ χ²(n−1)
For the accuracy anchor (n=6, σ=0.048): (6−1)s²/0.048² = 5s²/0.002304 ~ χ²(5).
- E[s²] = σ² (unbiased — this is what Bessel's correction achieves)
- The distribution is right-skewed, not symmetric
- Confidence intervals for σ² are therefore asymmetric
The CLT Connection
Three distinct but related ideas, often confused:
| Concept | What it says | Level |
|---|---|---|
| Sampling distribution | Any statistic has a distribution across repeated samples | Concept |
| CLT | x̄'s sampling distribution is approximately Normal(μ, σ²/n) for large n | Theorem |
| Standard Error | The SD of that sampling distribution is σ/√n | Parameter |
The pipeline: Population (shape unknown, mean μ, SD σ) → draw n observations → compute x̄ → x̄ follows Normal(μ, σ²/n) by CLT, with spread measured by SE = σ/√n.
The CLT says which distribution (Normal). SE says how wide it is. Sampling distribution is the general concept that applies to any statistic (median, variance, ratio), not just the mean.
Code
import numpy as np
from scipy import stats
rng = np.random.default_rng(42)
mu, sigma, n = 0.855, 0.048, 6
# Simulate sampling distribution of x̄
sample_means = [rng.normal(mu, sigma, size=n).mean() for _ in range(10_000)]
print("Sampling distribution of x̄ (n=6):")
print(f" Mean of sample means: {np.mean(sample_means):.4f} (expected {mu})")
print(f" SD of sample means: {np.std(sample_means):.4f} (expected SE={sigma/np.sqrt(n):.4f})")
# SE for three sample sizes
print("\nSE shrinkage with n:")
for n_val in [6, 25, 100]:
se = sigma / np.sqrt(n_val)
print(f" n={n_val:3d}: SE = {sigma}/{np.sqrt(n_val):.4f} = {se:.4f}")
# Sampling distribution for proportion
p_true, n_prop = 0.60, 50
se_prop = np.sqrt(p_true * (1 - p_true) / n_prop)
print(f"\nSampling distribution of p̂ (p={p_true}, n={n_prop}):")
print(f" E[p̂] = {p_true} (unbiased)")
print(f" SE(p̂) = {se_prop:.4f}")
print(f" np={n_prop*p_true:.0f} >= 10: {n_prop*p_true >= 10} → Normal approx valid")
# Sampling distribution of variance: (n-1)s^2/sigma^2 ~ chi2(n-1)
n_var = 6
chi2_df = n_var - 1
dist_var = stats.chi2(df=chi2_df)
print(f"\nSampling distribution of (n-1)s²/σ² ~ chi2({chi2_df}):")
print(f" E[statistic] = {dist_var.mean():.1f} (= df = {chi2_df})")
print(f" Var[statistic] = {dist_var.var():.1f} (= 2*df = {2*chi2_df})")
print(f" Right-skewed: skewness = {2*np.sqrt(2/chi2_df):.3f}")
# Verify unbiasedness of s^2 via simulation
samples = rng.normal(mu, sigma, size=(10_000, n_var))
s2_vals = samples.var(axis=1, ddof=1)
print(f"\n E[s²] ≈ {s2_vals.mean():.6f} (σ² = {sigma**2:.6f}) — unbiased ✓")Sampling distribution of x̄ (n=6):
Mean of sample means: 0.8550 (expected 0.855)
SD of sample means: 0.0196 (expected SE=0.0196)
SE shrinkage with n:
n= 6: SE = 0.048/2.4495 = 0.0196
n= 25: SE = 0.048/5.0000 = 0.0096
n=100: SE = 0.048/10.0000 = 0.0048
Sampling distribution of p̂ (p=0.60, n=50):
E[p̂] = 0.6 (unbiased)
SE(p̂) = 0.0693
np=30 >= 10: True → Normal approx valid
Sampling distribution of (n-1)s²/σ² ~ chi2(5):
E[statistic] = 5.0 (= df = 5)
Var[statistic] = 10.0 (= 2*df = 10)
Right-skewed: skewness = 1.265
E[s²] ≈ 0.002305 (σ² = 0.002304) — unbiased ✓
ML Applications
1 — CV accuracy is a sample mean. The mean accuracy across k CV folds has SE = σ_folds/√k. Comparing two models on k=5 folds gives estimates with SE ≈ 0.02 — a 1% accuracy difference may not be statistically significant. More folds → smaller SE → more power to detect real differences.
2 — A/B test metrics. The difference in CTR between model A and model B is a sample statistic. Its sampling distribution determines whether the observed difference could arise by chance. The SE of the difference is √(SE_A² + SE_B²), which shrinks as the number of users grows.
3 — Bootstrap as an empirical sampling distribution. When the population distribution is unknown, we cannot compute SE analytically. The bootstrap simulates the sampling distribution by resampling from the observed data: draw n samples with replacement 10,000 times, compute x̄ each time, take the SD of the bootstrap means. This SD approximates SE without any parametric assumptions.
Limitations
- Independence required: the derivation SE = σ/√n uses Var(ΣXᵢ) = nσ². This is valid only when X₁, ..., X_n are independent. For time series, spatial data, or hierarchical data, SE is underestimated — leading to anti-conservative inference.
- σ unknown in practice: we use s/√n instead of σ/√n, which introduces additional uncertainty. For small n, this additional uncertainty matters — the t-distribution (not Normal) correctly accounts for it.
- CLT requires large enough n: the sampling distribution of x̄ is only approximately Normal. For highly skewed populations, n may need to be 50+ before the approximation is adequate.
Test Your Understanding
-
The 6-fold accuracy data has x̄=0.855 and s=0.048. Compute SE and state what it measures. Now suppose you increase to n=24 folds. By what factor does SE decrease? What would SE be?
-
Derive Var(x̄) = σ²/n starting from x̄ = (1/n)ΣXᵢ. At what step does the independence of X₁,...,Xₙ enter the derivation? What goes wrong if observations are correlated?
-
A colleague says "SE=0.0196 means the model's accuracy is highly consistent." Is this correct? What does SE=0.0196 actually measure, and what statistic would you use to describe consistency of individual fold scores?
-
You want to estimate P(accuracy > 0.90) from test data with n=30. Compute SE(p̂) if the true proportion is 0.15. Check whether the Normal approximation is valid. What would you use instead if it isn't?
-
Two models are evaluated on k=6 folds each. Model A: x̄_A=0.855, s_A=0.048. Model B: x̄_B=0.870, s_B=0.051. The difference in means is 0.015. Compute the SE of the difference (assuming independence) and state whether you can conclude Model B is better.