Back to blog
← View series: statistics

~/blog

Sampling Distributions

Apr 17, 202610 min readBy Mohammed Vasim
StatisticsMathData Science

You train a model and get x̄ = 0.855 mean accuracy across 6 CV folds. You use this to estimate the model's true accuracy μ. But if you re-ran the same experiment with different data splits, you would get a different x̄. The sample mean is not a fixed number — it is a random variable that follows its own distribution. That distribution is the sampling distribution, and understanding it is what separates black-box use of p-values from genuine understanding of statistical inference.

The Core Questions

  • How much does x̄ vary across different samples of size n?
  • What distribution does x̄ follow?
  • How does n affect this variability?

Without answering these, confidence intervals and hypothesis tests are formulas you apply without understanding what they measure.

The DS/ML Anchor

Six CV fold accuracy scores: accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88]

Sample mean: x̄ = 0.855. Sample SD: s ≈ 0.048.

We treat this as one draw from the sampling distribution of the mean. We want to characterize all possible x̄ values we could observe.

Data Distribution vs Sampling Distribution

This distinction is the conceptual core of inferential statistics:

Data DistributionSampling Distribution of x̄
What is plottedIndividual fold scoresAll possible sample means
One observationSingle fold accuracy (e.g., 0.82)Mean of 6 folds (e.g., 0.855)
Centerμ (true population mean)μ (same)
Spreadσ = 0.048 (population SD)σ/√n = 0.048/√6 ≈ 0.020
ShapeWhatever the population hasApproaches Normal (CLT)
How to get itObserve dataRepeat the study many times

The sampling distribution is narrower than the data distribution by a factor of √n. With n=6, the sample mean varies 2.45× less than individual observations.

Data distribution vs sampling distribution of x̄ p 0.70 0.80 0.90 1.00 μ=0.855 Data dist. σ=0.048 Sampling dist. of x̄ SE=0.020 Both centered at μ — but SE is σ/√6 times narrower than σ

Simulation — Making the Abstract Concrete

The sampling distribution can be simulated directly. Draw 10,000 samples of size 6 from Normal(0.855, 0.048) and compute the mean of each:

python
import numpy as np
import matplotlib.pyplot as plt

rng = np.random.default_rng(42)
mu, sigma, n = 0.855, 0.048, 6

# Simulate 10,000 samples of size 6
sample_means = [rng.normal(mu, sigma, size=n).mean() for _ in range(10_000)]

mean_of_means = np.mean(sample_means)
sd_of_means   = np.std(sample_means)

print(f"Mean of sample means: {mean_of_means:.4f}   (expected μ = {mu})")
print(f"SD of sample means:   {sd_of_means:.4f}   (expected SE = {sigma/np.sqrt(n):.4f})")
Mean of sample means: 0.8550 (expected μ = 0.855) SD of sample means: 0.0196 (expected SE = 0.0196)

Two results confirmed by simulation:

  1. Unbiasedness: the mean of sample means ≈ μ. x̄ is an unbiased estimator of μ.
  2. SE formula: the SD of sample means ≈ σ/√n = 0.0196. This is the standard error.

Standard Error of the Mean

SE = σ/√n (known σ) or SE = s/√n (estimated from sample)

Derivation from first principles:

x̄ = (1/n) × (X₁ + X₂ + ... + X_n)

Var(x̄) = Var[(1/n) × ΣXᵢ] = (1/n²) × Var(ΣXᵢ)

= (1/n²) × n × σ² (using independence of X₁, ..., X_n)

= σ²/n

SE = √(Var(x̄)) = σ/√n

The independence assumption is critical — it fails for autocorrelated time series, clustered data, or repeated measures. In those cases, the formula underestimates the true SE.

Anchor: SE = 0.048/√6 ≈ 0.0196

Effect of sample size on SE:

SE shrinks as n grows — quadruple n to halve SE accuracy 0.79 0.82 0.85 0.88 n=6 SE=0.020 n=25 SE=0.0096 n=100 SE=0.0048

SE vs SD — a required distinction:

SD (s or σ)SE (s/√n)
What it measuresSpread of individual observations from the meanSpread of sample means from the true mean
Decreases with more data?No — converges to σYes — shrinks as 1/√n
Answers"How variable are individual fold scores?""How precisely does x̄ estimate μ?"

Saying "SE = 0.02" does not mean the data has low variability — it means the sample mean is a precise estimate. The data could have SD = 0.048 and SE = 0.020 simultaneously.

Sampling Distribution for Proportions

When you compute p̂ = fraction of folds above a threshold, p̂ is also a sample statistic with its own sampling distribution.

Example: P(fold accuracy > 0.85) = p = 0.60 (true proportion). With n=50 folds:

  • E[p̂] = p = 0.60 (unbiased)
  • SE(p̂) = √(p(1−p)/n) = √(0.60×0.40/50) = 0.069
  • Normal approximation valid when: np = 30 ≥ 10 and n(1−p) = 20 ≥ 10 ✓
Sampling distribution of p̂ (n=50, p=0.60) 0.35 0.60 0.85 −SE +SE SE=0.069, Normal approx valid (np=30≥10)

Sampling Distribution for the Variance

The sample variance s² does not follow a normal sampling distribution — it follows a scaled chi-square:

(n−1)s² / σ² ~ χ²(n−1)

For the accuracy anchor (n=6, σ=0.048): (6−1)s²/0.048² = 5s²/0.002304 ~ χ²(5).

  • E[s²] = σ² (unbiased — this is what Bessel's correction achieves)
  • The distribution is right-skewed, not symmetric
  • Confidence intervals for σ² are therefore asymmetric
Sampling distribution of (n−1)s²/σ² ~ χ²(5) 0 5 10 15 E[X]=5 (=df) mode=3 Right-skewed → asymmetric CIs for σ²

The CLT Connection

Three distinct but related ideas, often confused:

ConceptWhat it saysLevel
Sampling distributionAny statistic has a distribution across repeated samplesConcept
CLTx̄'s sampling distribution is approximately Normal(μ, σ²/n) for large nTheorem
Standard ErrorThe SD of that sampling distribution is σ/√nParameter

The pipeline: Population (shape unknown, mean μ, SD σ) → draw n observations → compute x̄ → x̄ follows Normal(μ, σ²/n) by CLT, with spread measured by SE = σ/√n.

The CLT says which distribution (Normal). SE says how wide it is. Sampling distribution is the general concept that applies to any statistic (median, variance, ratio), not just the mean.

Code

python
import numpy as np
from scipy import stats

rng = np.random.default_rng(42)
mu, sigma, n = 0.855, 0.048, 6

# Simulate sampling distribution of x̄
sample_means = [rng.normal(mu, sigma, size=n).mean() for _ in range(10_000)]
print("Sampling distribution of x̄ (n=6):")
print(f"  Mean of sample means: {np.mean(sample_means):.4f}  (expected {mu})")
print(f"  SD of sample means:   {np.std(sample_means):.4f}  (expected SE={sigma/np.sqrt(n):.4f})")

# SE for three sample sizes
print("\nSE shrinkage with n:")
for n_val in [6, 25, 100]:
    se = sigma / np.sqrt(n_val)
    print(f"  n={n_val:3d}: SE = {sigma}/{np.sqrt(n_val):.4f} = {se:.4f}")

# Sampling distribution for proportion
p_true, n_prop = 0.60, 50
se_prop = np.sqrt(p_true * (1 - p_true) / n_prop)
print(f"\nSampling distribution of p̂ (p={p_true}, n={n_prop}):")
print(f"  E[p̂] = {p_true}  (unbiased)")
print(f"  SE(p̂) = {se_prop:.4f}")
print(f"  np={n_prop*p_true:.0f} >= 10: {n_prop*p_true >= 10}  → Normal approx valid")

# Sampling distribution of variance: (n-1)s^2/sigma^2 ~ chi2(n-1)
n_var = 6
chi2_df = n_var - 1
dist_var = stats.chi2(df=chi2_df)
print(f"\nSampling distribution of (n-1)s²/σ² ~ chi2({chi2_df}):")
print(f"  E[statistic] = {dist_var.mean():.1f}  (= df = {chi2_df})")
print(f"  Var[statistic] = {dist_var.var():.1f}  (= 2*df = {2*chi2_df})")
print(f"  Right-skewed: skewness = {2*np.sqrt(2/chi2_df):.3f}")

# Verify unbiasedness of s^2 via simulation
samples = rng.normal(mu, sigma, size=(10_000, n_var))
s2_vals = samples.var(axis=1, ddof=1)
print(f"\n  E[s²] ≈ {s2_vals.mean():.6f}  (σ² = {sigma**2:.6f}) — unbiased ✓")
Sampling distribution of x̄ (n=6): Mean of sample means: 0.8550 (expected 0.855) SD of sample means: 0.0196 (expected SE=0.0196) SE shrinkage with n: n= 6: SE = 0.048/2.4495 = 0.0196 n= 25: SE = 0.048/5.0000 = 0.0096 n=100: SE = 0.048/10.0000 = 0.0048 Sampling distribution of p̂ (p=0.60, n=50): E[p̂] = 0.6 (unbiased) SE(p̂) = 0.0693 np=30 >= 10: True → Normal approx valid Sampling distribution of (n-1)s²/σ² ~ chi2(5): E[statistic] = 5.0 (= df = 5) Var[statistic] = 10.0 (= 2*df = 10) Right-skewed: skewness = 1.265 E[s²] ≈ 0.002305 (σ² = 0.002304) — unbiased ✓

ML Applications

1 — CV accuracy is a sample mean. The mean accuracy across k CV folds has SE = σ_folds/√k. Comparing two models on k=5 folds gives estimates with SE ≈ 0.02 — a 1% accuracy difference may not be statistically significant. More folds → smaller SE → more power to detect real differences.

2 — A/B test metrics. The difference in CTR between model A and model B is a sample statistic. Its sampling distribution determines whether the observed difference could arise by chance. The SE of the difference is √(SE_A² + SE_B²), which shrinks as the number of users grows.

3 — Bootstrap as an empirical sampling distribution. When the population distribution is unknown, we cannot compute SE analytically. The bootstrap simulates the sampling distribution by resampling from the observed data: draw n samples with replacement 10,000 times, compute x̄ each time, take the SD of the bootstrap means. This SD approximates SE without any parametric assumptions.

Limitations

  • Independence required: the derivation SE = σ/√n uses Var(ΣXᵢ) = nσ². This is valid only when X₁, ..., X_n are independent. For time series, spatial data, or hierarchical data, SE is underestimated — leading to anti-conservative inference.
  • σ unknown in practice: we use s/√n instead of σ/√n, which introduces additional uncertainty. For small n, this additional uncertainty matters — the t-distribution (not Normal) correctly accounts for it.
  • CLT requires large enough n: the sampling distribution of x̄ is only approximately Normal. For highly skewed populations, n may need to be 50+ before the approximation is adequate.

Test Your Understanding

  1. The 6-fold accuracy data has x̄=0.855 and s=0.048. Compute SE and state what it measures. Now suppose you increase to n=24 folds. By what factor does SE decrease? What would SE be?

  2. Derive Var(x̄) = σ²/n starting from x̄ = (1/n)ΣXᵢ. At what step does the independence of X₁,...,Xₙ enter the derivation? What goes wrong if observations are correlated?

  3. A colleague says "SE=0.0196 means the model's accuracy is highly consistent." Is this correct? What does SE=0.0196 actually measure, and what statistic would you use to describe consistency of individual fold scores?

  4. You want to estimate P(accuracy > 0.90) from test data with n=30. Compute SE(p̂) if the true proportion is 0.15. Check whether the Normal approximation is valid. What would you use instead if it isn't?

  5. Two models are evaluated on k=6 folds each. Model A: x̄_A=0.855, s_A=0.048. Model B: x̄_B=0.870, s_B=0.051. The difference in means is 0.015. Compute the SE of the difference (assuming independence) and state whether you can conclude Model B is better.

Comments (0)

No comments yet. Be the first to comment!

Leave a comment