~/blog

Sampling Distributions

Apr 17, 2026•10 min read•By Mohammed Vasim

StatisticsMathData Science

You train a model and get x̄ = 0.855 mean accuracy across 6 CV folds. You use this to estimate the model's true accuracy μ. But if you re-ran the same experiment with different data splits, you would get a different x̄. The sample mean is not a fixed number — it is a random variable that follows its own distribution. That distribution is the sampling distribution, and understanding it is what separates black-box use of p-values from genuine understanding of statistical inference.

The Core Questions

How much does x̄ vary across different samples of size n?
What distribution does x̄ follow?
How does n affect this variability?

Without answering these, confidence intervals and hypothesis tests are formulas you apply without understanding what they measure.

The DS/ML Anchor

Six CV fold accuracy scores: accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88]

Sample mean: x̄ = 0.855. Sample SD: s ≈ 0.048.

We treat this as one draw from the sampling distribution of the mean. We want to characterize all possible x̄ values we could observe.

Data Distribution vs Sampling Distribution

This distinction is the conceptual core of inferential statistics:

	Data Distribution	Sampling Distribution of x̄
What is plotted	Individual fold scores	All possible sample means
One observation	Single fold accuracy (e.g., 0.82)	Mean of 6 folds (e.g., 0.855)
Center	μ (true population mean)	μ (same)
Spread	σ = 0.048 (population SD)	σ/√n = 0.048/√6 ≈ 0.020
Shape	Whatever the population has	Approaches Normal (CLT)
How to get it	Observe data	Repeat the study many times

The sampling distribution is narrower than the data distribution by a factor of √n. With n=6, the sample mean varies 2.45× less than individual observations.

Simulation — Making the Abstract Concrete

The sampling distribution can be simulated directly. Draw 10,000 samples of size 6 from Normal(0.855, 0.048) and compute the mean of each:

python

import numpy as np
import matplotlib.pyplot as plt

rng = np.random.default_rng(42)
mu, sigma, n = 0.855, 0.048, 6

# Simulate 10,000 samples of size 6
sample_means = [rng.normal(mu, sigma, size=n).mean() for _ in range(10_000)]

mean_of_means = np.mean(sample_means)
sd_of_means   = np.std(sample_means)

print(f"Mean of sample means: {mean_of_means:.4f}   (expected μ = {mu})")
print(f"SD of sample means:   {sd_of_means:.4f}   (expected SE = {sigma/np.sqrt(n):.4f})")

Mean of sample means: 0.8550   (expected μ = 0.855)
SD of sample means:   0.0196   (expected SE = 0.0196)

Two results confirmed by simulation:

Unbiasedness: the mean of sample means ≈ μ. x̄ is an unbiased estimator of μ.
SE formula: the SD of sample means ≈ σ/√n = 0.0196. This is the standard error.

Standard Error of the Mean

SE = σ/√n (known σ) or SE = s/√n (estimated from sample)

Derivation from first principles:

x̄ = (1/n) × (X₁ + X₂ + ... + X_n)

Var(x̄) = Var[(1/n) × ΣXᵢ] = (1/n²) × Var(ΣXᵢ)

= (1/n²) × n × σ² (using independence of X₁, ..., X_n)

= σ²/n

SE = √(Var(x̄)) = σ/√n

The independence assumption is critical — it fails for autocorrelated time series, clustered data, or repeated measures. In those cases, the formula underestimates the true SE.

Anchor: SE = 0.048/√6 ≈ 0.0196

Effect of sample size on SE:

SE vs SD — a required distinction:

	SD (s or σ)	SE (s/√n)
What it measures	Spread of individual observations from the mean	Spread of sample means from the true mean
Decreases with more data?	No — converges to σ	Yes — shrinks as 1/√n
Answers	"How variable are individual fold scores?"	"How precisely does x̄ estimate μ?"

Saying "SE = 0.02" does not mean the data has low variability — it means the sample mean is a precise estimate. The data could have SD = 0.048 and SE = 0.020 simultaneously.

Sampling Distribution for Proportions

When you compute p̂ = fraction of folds above a threshold, p̂ is also a sample statistic with its own sampling distribution.

Example: P(fold accuracy > 0.85) = p = 0.60 (true proportion). With n=50 folds:

E[p̂] = p = 0.60 (unbiased)
SE(p̂) = √(p(1−p)/n) = √(0.60×0.40/50) = 0.069
Normal approximation valid when: np = 30 ≥ 10 and n(1−p) = 20 ≥ 10 ✓

Sampling Distribution for the Variance

The sample variance s² does not follow a normal sampling distribution — it follows a scaled chi-square:

(n−1)s² / σ² ~ χ²(n−1)

For the accuracy anchor (n=6, σ=0.048): (6−1)s²/0.048² = 5s²/0.002304 ~ χ²(5).

E[s²] = σ² (unbiased — this is what Bessel's correction achieves)
The distribution is right-skewed, not symmetric
Confidence intervals for σ² are therefore asymmetric

The CLT Connection

Three distinct but related ideas, often confused:

Concept	What it says	Level
Sampling distribution	Any statistic has a distribution across repeated samples	Concept
CLT	x̄'s sampling distribution is approximately Normal(μ, σ²/n) for large n	Theorem
Standard Error	The SD of that sampling distribution is σ/√n	Parameter

The pipeline: Population (shape unknown, mean μ, SD σ) → draw n observations → compute x̄ → x̄ follows Normal(μ, σ²/n) by CLT, with spread measured by SE = σ/√n.

The CLT says which distribution (Normal). SE says how wide it is. Sampling distribution is the general concept that applies to any statistic (median, variance, ratio), not just the mean.

Code

python

import numpy as np
from scipy import stats

rng = np.random.default_rng(42)
mu, sigma, n = 0.855, 0.048, 6

# Simulate sampling distribution of x̄
sample_means = [rng.normal(mu, sigma, size=n).mean() for _ in range(10_000)]
print("Sampling distribution of x̄ (n=6):")
print(f"  Mean of sample means: {np.mean(sample_means):.4f}  (expected {mu})")
print(f"  SD of sample means:   {np.std(sample_means):.4f}  (expected SE={sigma/np.sqrt(n):.4f})")

# SE for three sample sizes
print("\nSE shrinkage with n:")
for n_val in [6, 25, 100]:
    se = sigma / np.sqrt(n_val)
    print(f"  n={n_val:3d}: SE = {sigma}/{np.sqrt(n_val):.4f} = {se:.4f}")

# Sampling distribution for proportion
p_true, n_prop = 0.60, 50
se_prop = np.sqrt(p_true * (1 - p_true) / n_prop)
print(f"\nSampling distribution of p̂ (p={p_true}, n={n_prop}):")
print(f"  E[p̂] = {p_true}  (unbiased)")
print(f"  SE(p̂) = {se_prop:.4f}")
print(f"  np={n_prop*p_true:.0f} >= 10: {n_prop*p_true >= 10}  → Normal approx valid")

# Sampling distribution of variance: (n-1)s^2/sigma^2 ~ chi2(n-1)
n_var = 6
chi2_df = n_var - 1
dist_var = stats.chi2(df=chi2_df)
print(f"\nSampling distribution of (n-1)s²/σ² ~ chi2({chi2_df}):")
print(f"  E[statistic] = {dist_var.mean():.1f}  (= df = {chi2_df})")
print(f"  Var[statistic] = {dist_var.var():.1f}  (= 2*df = {2*chi2_df})")
print(f"  Right-skewed: skewness = {2*np.sqrt(2/chi2_df):.3f}")

# Verify unbiasedness of s^2 via simulation
samples = rng.normal(mu, sigma, size=(10_000, n_var))
s2_vals = samples.var(axis=1, ddof=1)
print(f"\n  E[s²] ≈ {s2_vals.mean():.6f}  (σ² = {sigma**2:.6f}) — unbiased ✓")

Sampling distribution of x̄ (n=6):
  Mean of sample means: 0.8550  (expected 0.855)
  SD of sample means:   0.0196  (expected SE=0.0196)

SE shrinkage with n:
  n=  6: SE = 0.048/2.4495 = 0.0196
  n= 25: SE = 0.048/5.0000 = 0.0096
  n=100: SE = 0.048/10.0000 = 0.0048

Sampling distribution of p̂ (p=0.60, n=50):
  E[p̂] = 0.6  (unbiased)
  SE(p̂) = 0.0693
  np=30 >= 10: True  → Normal approx valid

Sampling distribution of (n-1)s²/σ² ~ chi2(5):
  E[statistic] = 5.0  (= df = 5)
  Var[statistic] = 10.0  (= 2*df = 10)
  Right-skewed: skewness = 1.265

  E[s²] ≈ 0.002305  (σ² = 0.002304) — unbiased ✓

ML Applications

1 — CV accuracy is a sample mean. The mean accuracy across k CV folds has SE = σ_folds/√k. Comparing two models on k=5 folds gives estimates with SE ≈ 0.02 — a 1% accuracy difference may not be statistically significant. More folds → smaller SE → more power to detect real differences.

2 — A/B test metrics. The difference in CTR between model A and model B is a sample statistic. Its sampling distribution determines whether the observed difference could arise by chance. The SE of the difference is √(SE_A² + SE_B²), which shrinks as the number of users grows.

3 — Bootstrap as an empirical sampling distribution. When the population distribution is unknown, we cannot compute SE analytically. The bootstrap simulates the sampling distribution by resampling from the observed data: draw n samples with replacement 10,000 times, compute x̄ each time, take the SD of the bootstrap means. This SD approximates SE without any parametric assumptions.

Limitations

Independence required: the derivation SE = σ/√n uses Var(ΣXᵢ) = nσ². This is valid only when X₁, ..., X_n are independent. For time series, spatial data, or hierarchical data, SE is underestimated — leading to anti-conservative inference.
σ unknown in practice: we use s/√n instead of σ/√n, which introduces additional uncertainty. For small n, this additional uncertainty matters — the t-distribution (not Normal) correctly accounts for it.
CLT requires large enough n: the sampling distribution of x̄ is only approximately Normal. For highly skewed populations, n may need to be 50+ before the approximation is adequate.

Test Your Understanding

The 6-fold accuracy data has x̄=0.855 and s=0.048. Compute SE and state what it measures. Now suppose you increase to n=24 folds. By what factor does SE decrease? What would SE be?
Derive Var(x̄) = σ²/n starting from x̄ = (1/n)ΣXᵢ. At what step does the independence of X₁,...,Xₙ enter the derivation? What goes wrong if observations are correlated?
A colleague says "SE=0.0196 means the model's accuracy is highly consistent." Is this correct? What does SE=0.0196 actually measure, and what statistic would you use to describe consistency of individual fold scores?
You want to estimate P(accuracy > 0.90) from test data with n=30. Compute SE(p̂) if the true proportion is 0.15. Check whether the Normal approximation is valid. What would you use instead if it isn't?
Two models are evaluated on k=6 folds each. Model A: x̄_A=0.855, s_A=0.048. Model B: x̄_B=0.870, s_B=0.051. The difference in means is 0.015. Compute the SE of the difference (assuming independence) and state whether you can conclude Model B is better.