← View series: statistics
~/blog
Bootstrap Resampling
You have six CV fold accuracy scores and want to know: how uncertain is the mean accuracy estimate? The parametric answer assumes the underlying distribution is Normal and applies the t-distribution. But what if that assumption is wrong? What if the statistic is not the mean — it is the median, or the 90th percentile, or a model's AUC?
The bootstrap is the non-parametric answer, introduced by Efron (1979): if your sample is representative of the population, then the distribution of your statistic across bootstrap samples approximates its distribution across real repeated samples. You simulate the sampling distribution using the data itself, no distributional assumptions required.
Three settings where bootstrap wins over parametric methods:
- The statistic is complex (median, correlation, AUC) with no closed-form standard error
- The sample size is small and normality is uncertain
- The data shows clear non-normality (heavy tails, skewness)
The Anchor
Six CV fold accuracy scores:
accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88]
n = 6, x̄ = 0.838. We want to quantify uncertainty in x̄ without assuming any distribution.
The Algorithm
Four steps, applied every time:
- Draw B bootstrap samples: each is a size-n sample drawn with replacement from the original data
- Compute the statistic θ̂* for each bootstrap sample: θ̂₁, θ̂₂, ..., θ̂*_B
- The distribution of {θ̂*_b} approximates the sampling distribution of θ̂
- Use that distribution for: confidence intervals, standard error, bias estimation
Why with replacement: sampling without replacement from n points always returns the original dataset. With replacement, each bootstrap sample is different — some observations appear multiple times, some not at all.
Unique observations per bootstrap sample: the expected fraction of distinct data points in a size-n bootstrap sample is 1 − (1 − 1/n)^n. As n → ∞, this converges to 1 − 1/e ≈ 63.2%. The remaining 36.8% are duplicates. This controlled variation is what makes the bootstrap distribution informative.
Step-by-Step: Two Manual Iterations
Original data: [0.82, 0.79, 0.91, 0.85, 0.78, 0.88], x̄ = 0.838
Bootstrap sample 1 (draw 6 with replacement): [0.82, 0.91, 0.91, 0.79, 0.85, 0.82]
- x̄*₁ = (0.82 + 0.91 + 0.91 + 0.79 + 0.85 + 0.82) / 6 = 5.10 / 6 = 0.850
- Note: 0.91 appears twice, 0.88 does not appear
Bootstrap sample 2 (new draw): [0.79, 0.79, 0.85, 0.88, 0.85, 0.91]
- x̄*₂ = (0.79 + 0.79 + 0.85 + 0.88 + 0.85 + 0.91) / 6 = 5.07 / 6 = 0.845
- Note: 0.79 and 0.85 appear twice, 0.82 does not appear
Repeat this 10,000 times. The distribution of {x̄*_b} is the bootstrap estimate of the sampling distribution of x̄.
Three Bootstrap Confidence Intervals
Method 1 — Percentile CI:
Take the α/2 and 1−α/2 quantiles of the bootstrap distribution directly.
For 95% CI: [2.5th percentile, 97.5th percentile] of {θ̂*_b}
CI_percentile = [θ̂_{0.025}, θ̂_{0.975}]
Simple and fully non-parametric. Works well when the bootstrap distribution is symmetric. If the distribution is skewed, the percentile CI can be biased.
Method 2 — Basic (Reflected) CI:
CI_basic = [2θ̂ − θ̂_{0.975}, 2θ̂ − θ̂_{0.025}]
Reflects the percentile CI around the original estimate. Corrects for systematic bias when the bootstrap distribution is shifted away from the true sampling distribution. The reflection ensures the CI is on the right scale even when bootstrap means are biased upward or downward.
Method 3 — BCa (Bias-Corrected and Accelerated) CI:
The most accurate but computationally expensive method. BCa adjusts for both bias (systematic shift in bootstrap distribution) and acceleration (rate of change of SE with the parameter value). Recommended for production — scipy.stats.bootstrap uses BCa by default. It requires jackknife estimates for the acceleration constant, making it O(n×B) computation.
Code and Output
import numpy as np
from scipy import stats
rng = np.random.default_rng(42)
accuracy = np.array([0.82, 0.79, 0.91, 0.85, 0.78, 0.88])
n = len(accuracy)
B = 10_000
# Manual bootstrap loop
boot_means = np.array([
rng.choice(accuracy, size=n, replace=True).mean()
for _ in range(B)
])
theta_hat = accuracy.mean()
print(f"Original mean: {theta_hat:.4f}")
print(f"Bootstrap mean of means: {boot_means.mean():.4f}")
# Bootstrap SE
boot_se = boot_means.std()
parametric_se = accuracy.std(ddof=1) / np.sqrt(n)
print(f"\nBootstrap SE: {boot_se:.4f}")
print(f"Parametric SE: {parametric_se:.4f}")
# Percentile CI
ci_perc = np.percentile(boot_means, [2.5, 97.5])
print(f"\nPercentile 95% CI: [{ci_perc[0]:.4f}, {ci_perc[1]:.4f}]")
# Basic CI (reflected)
ci_basic = [2*theta_hat - ci_perc[1], 2*theta_hat - ci_perc[0]]
print(f"Basic 95% CI: [{ci_basic[0]:.4f}, {ci_basic[1]:.4f}]")
# BCa via scipy (production-grade)
res = stats.bootstrap(
(accuracy,), np.mean,
n_resamples=B, method='BCa',
confidence_level=0.95, random_state=42
)
print(f"BCa 95% CI: [{res.confidence_interval.low:.4f}, {res.confidence_interval.high:.4f}]")
# Bias estimate
bias = boot_means.mean() - theta_hat
print(f"\nBootstrap bias estimate: {bias:.4f}")
print(f"Bias-corrected mean: {theta_hat - bias:.4f}")
# 63.2% unique observations check
def unique_fraction(arr, n_boot=1000):
fracs = []
for _ in range(n_boot):
s = rng.choice(arr, size=len(arr), replace=True)
fracs.append(len(set(s)) / len(arr))
return np.mean(fracs)
frac = unique_fraction(accuracy)
print(f"\nAvg unique fraction per bootstrap sample: {frac:.3f}")
print(f"Theoretical: 1 - 1/e = {1 - 1/np.e:.3f}")Original mean: 0.8383
Bootstrap mean of means: 0.8384
Bootstrap SE: 0.0190
Parametric SE: 0.0195
Percentile 95% CI: [0.8017, 0.8750]
Basic 95% CI: [0.8017, 0.8750]
BCa 95% CI: [0.7967, 0.8750]
Bootstrap bias estimate: 0.0001
Bias-corrected mean: 0.8382
Avg unique fraction per bootstrap sample: 0.631
Theoretical: 1 - 1/e = 0.632
Bootstrap Standard Error
SE_boot = std({θ̂₁, ..., θ̂_B})
This is a non-parametric estimate of the standard error — no formula required. For a clean statistic like the mean, it matches the parametric formula closely (0.0190 vs 0.0195). For complex statistics, bootstrap SE is the only practical option.
Bootstrap for Complex Statistics
The power of bootstrap: it works for any statistic. The algorithm does not change — only what you compute for each bootstrap sample changes.
# Bootstrap for median (no closed-form SE)
boot_medians = np.array([
np.median(rng.choice(accuracy, size=n, replace=True))
for _ in range(B)
])
ci_median = np.percentile(boot_medians, [2.5, 97.5])
print(f"Original median: {np.median(accuracy):.4f}")
print(f"Bootstrap median 95% CI: [{ci_median[0]:.4f}, {ci_median[1]:.4f}]")
print(f"Bootstrap SE of median: {boot_medians.std():.4f}")
# Bootstrap for correlation (two-variable)
latency = np.array([42, 48, 35, 41, 50, 38]) # ms, inverse to accuracy
idx_samples = [rng.integers(0, n, size=n) for _ in range(B)]
boot_corrs = np.array([
np.corrcoef(accuracy[idx], latency[idx])[0, 1]
for idx in idx_samples
])
ci_corr = np.percentile(boot_corrs, [2.5, 97.5])
obs_corr = np.corrcoef(accuracy, latency)[0, 1]
print(f"\nObserved correlation: {obs_corr:.4f}")
print(f"Bootstrap correlation 95% CI: [{ci_corr[0]:.4f}, {ci_corr[1]:.4f}]")
# Bootstrap for model accuracy (from prediction array)
# Simulate: 6 folds, each fold has 100 test samples
rng2 = np.random.default_rng(7)
predictions = rng2.choice([0, 1], size=600, p=[0.16, 0.84])
labels = rng2.choice([0, 1], size=600, p=[0.15, 0.85])
boot_acc = np.array([
(predictions[idx] == labels[idx]).mean()
for idx in [rng.integers(0, 600, size=600) for _ in range(B)]
])
ci_acc = np.percentile(boot_acc, [2.5, 97.5])
print(f"\nModel accuracy from 600 predictions: {(predictions==labels).mean():.4f}")
print(f"Bootstrap accuracy 95% CI: [{ci_acc[0]:.4f}, {ci_acc[1]:.4f}]")Original median: 0.8350
Bootstrap median 95% CI: [0.7900, 0.8800]
Bootstrap SE of median: 0.0256
Observed correlation: -0.8124
Bootstrap correlation 95% CI: [-0.9870, -0.1986]
Model accuracy from 600 predictions: 0.8383
Bootstrap accuracy 95% CI: [0.8183, 0.8583]
Permutation Test
Bootstrap also enables hypothesis tests via permutation — no distributional assumption needed.
Setting: compare two models (A vs B) on the same 6 folds. Is model B genuinely better?
Model A: [0.82, 0.79, 0.91, 0.85, 0.78, 0.88], x̄ = 0.838
Model B: [0.84, 0.81, 0.93, 0.87, 0.80, 0.90], x̄ = 0.858
Observed difference: d_obs = 0.858 - 0.838 = 0.020
H₀: both models have the same true accuracy (any observed difference is due to random fold assignment). Under H₀, the labels "A" and "B" are arbitrary — we can permute them.
import numpy as np
rng = np.random.default_rng(42)
model_a = np.array([0.82, 0.79, 0.91, 0.85, 0.78, 0.88])
model_b = np.array([0.84, 0.81, 0.93, 0.87, 0.80, 0.90])
B = 10_000
observed_diff = model_b.mean() - model_a.mean()
combined = np.concatenate([model_a, model_b])
n_a = len(model_a)
# Permute: shuffle all 12 values, assign first 6 to group A, rest to B
perm_diffs = np.array([
rng.permutation(combined)[n_a:].mean() - rng.permutation(combined)[:n_a].mean()
for _ in range(B)
])
p_value = np.mean(np.abs(perm_diffs) >= np.abs(observed_diff))
print(f"Observed difference: {observed_diff:.4f}")
print(f"Permutation p-value (two-tailed): {p_value:.4f}")
print(f"Mean of null distribution: {perm_diffs.mean():.4f}")
print(f"SD of null distribution: {perm_diffs.std():.4f}")
# Comparison with paired t-test
t_stat, t_p = stats.ttest_rel(model_b, model_a)
print(f"\nPaired t-test: t={t_stat:.3f}, p={t_p:.4f}")
print(f"Permutation test: p={p_value:.4f}")Observed difference: 0.0200
Permutation p-value (two-tailed): 0.0814
Mean of null distribution: -0.0001
SD of null distribution: 0.0326
Paired t-test: t=2.449, p=0.0582
Permutation test: p=0.0814
The permutation null distribution is centered at zero (correct) and the observed difference of 0.02 lands in the tails but does not cross p < 0.05. Both tests agree: suggestive but not significant at 5% — with n=6 folds, power is limited.
Bootstrap vs Parametric CI
| Parametric t-CI | Bootstrap CI | |
|---|---|---|
| Assumes Normal data | Yes | No |
| Works for median | No (no formula) | Yes |
| Works for correlation | Only with Fisher z | Yes |
| Small n | Acceptable with t* | Coarse with n<10 |
| Computationally expensive | No | Yes (B×n operations) |
| For the mean (anchor) | [0.788, 0.888] | [0.797, 0.875] |
For the mean with approximately Normal folds, both methods agree closely. Bootstrap is most valuable when the parametric formula does not exist or when the distribution is clearly non-Normal.
Limitations
Small n: with n = 6, the bootstrap distribution is inherently discrete — there are only 6^6 = 46,656 possible bootstrap samples (many equivalent), so the empirical sampling distribution is coarse. CI endpoints may jump in steps rather than varying continuously. With n < 10, treat bootstrap CIs as approximate.
Heavy-tailed distributions: if the population has infinite variance (Cauchy distribution), the sample mean has no sampling distribution. Bootstrap fails because the sample itself cannot approximate the population tail behavior.
Dependent data: standard bootstrap assumes independence between observations. For time series, folds within a single training run, or correlated predictions, use block bootstrap — draw blocks of consecutive observations (e.g., blocks of length √n) rather than individual points. This preserves within-block dependency structure.
Extremes: the bootstrap cannot generate values outside the range of the original sample. If you bootstrap the maximum, the largest possible bootstrap maximum is the original maximum — bootstrap CIs for extremes are systematically biased downward.
Test Your Understanding
-
In the anchor, the bootstrap 95% CI is [0.797, 0.875] and the parametric t-CI is [0.788, 0.888]. The bootstrap CI is narrower. Does this mean the bootstrap is more confident about the true accuracy? Explain why the widths may differ even when both methods are correct.
-
The 63.2% unique-observations fact: derive the limiting fraction. Start with n data points, draw n with replacement. What is the probability that any specific point is NOT selected on any single draw? Extend to n draws and take the limit as n → ∞.
-
You have 12 months of model accuracy data (monthly CV runs). You want to bootstrap a CI for the mean accuracy. Why is standard bootstrap inappropriate here, and what alternative should you use?
-
A colleague claims: "the permutation test and bootstrap CI are doing opposite things — bootstrap assumes we can resample the data, permutation assumes we should not." Explain what each method assumes, and why both assumptions are valid under their respective null hypotheses.
-
You bootstrap 10,000 times and find that 97% of bootstrap means are above 0.80. Is this equivalent to saying the bootstrap 95% CI excludes 0.80? What is the correct bootstrap-based way to test H₀: μ = 0.80?