Back to blog
← View series: statistics

~/blog

Bootstrap Resampling

Apr 18, 202610 min readBy Mohammed Vasim
StatisticsMathData Science

You have six CV fold accuracy scores and want to know: how uncertain is the mean accuracy estimate? The parametric answer assumes the underlying distribution is Normal and applies the t-distribution. But what if that assumption is wrong? What if the statistic is not the mean — it is the median, or the 90th percentile, or a model's AUC?

The bootstrap is the non-parametric answer, introduced by Efron (1979): if your sample is representative of the population, then the distribution of your statistic across bootstrap samples approximates its distribution across real repeated samples. You simulate the sampling distribution using the data itself, no distributional assumptions required.

Three settings where bootstrap wins over parametric methods:

  1. The statistic is complex (median, correlation, AUC) with no closed-form standard error
  2. The sample size is small and normality is uncertain
  3. The data shows clear non-normality (heavy tails, skewness)

The Anchor

Six CV fold accuracy scores:

accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88]

n = 6, x̄ = 0.838. We want to quantify uncertainty in x̄ without assuming any distribution.

The Algorithm

Four steps, applied every time:

  1. Draw B bootstrap samples: each is a size-n sample drawn with replacement from the original data
  2. Compute the statistic θ̂* for each bootstrap sample: θ̂₁, θ̂₂, ..., θ̂*_B
  3. The distribution of {θ̂*_b} approximates the sampling distribution of θ̂
  4. Use that distribution for: confidence intervals, standard error, bias estimation

Why with replacement: sampling without replacement from n points always returns the original dataset. With replacement, each bootstrap sample is different — some observations appear multiple times, some not at all.

Unique observations per bootstrap sample: the expected fraction of distinct data points in a size-n bootstrap sample is 1 − (1 − 1/n)^n. As n → ∞, this converges to 1 − 1/e ≈ 63.2%. The remaining 36.8% are duplicates. This controlled variation is what makes the bootstrap distribution informative.

Step-by-Step: Two Manual Iterations

Original data: [0.82, 0.79, 0.91, 0.85, 0.78, 0.88], x̄ = 0.838

Bootstrap sample 1 (draw 6 with replacement): [0.82, 0.91, 0.91, 0.79, 0.85, 0.82]

  • x̄*₁ = (0.82 + 0.91 + 0.91 + 0.79 + 0.85 + 0.82) / 6 = 5.10 / 6 = 0.850
  • Note: 0.91 appears twice, 0.88 does not appear

Bootstrap sample 2 (new draw): [0.79, 0.79, 0.85, 0.88, 0.85, 0.91]

  • x̄*₂ = (0.79 + 0.79 + 0.85 + 0.88 + 0.85 + 0.91) / 6 = 5.07 / 6 = 0.845
  • Note: 0.79 and 0.85 appear twice, 0.82 does not appear

Repeat this 10,000 times. The distribution of {x̄*_b} is the bootstrap estimate of the sampling distribution of x̄.

Original 6 data points → 10,000 bootstrap means → empirical sampling distribution Original data (n=6) .78 .79 .82 .85 .88 .91 B=10K Bootstrap means distribution x̄=0.838 0.79 0.84 0.90 bell-shaped, centered at x̄ — empirical sampling distribution

Three Bootstrap Confidence Intervals

Method 1 — Percentile CI:

Take the α/2 and 1−α/2 quantiles of the bootstrap distribution directly.

For 95% CI: [2.5th percentile, 97.5th percentile] of {θ̂*_b}

CI_percentile = [θ̂_{0.025}, θ̂_{0.975}]

Simple and fully non-parametric. Works well when the bootstrap distribution is symmetric. If the distribution is skewed, the percentile CI can be biased.

Method 2 — Basic (Reflected) CI:

CI_basic = [2θ̂ − θ̂_{0.975}, 2θ̂ − θ̂_{0.025}]

Reflects the percentile CI around the original estimate. Corrects for systematic bias when the bootstrap distribution is shifted away from the true sampling distribution. The reflection ensures the CI is on the right scale even when bootstrap means are biased upward or downward.

Method 3 — BCa (Bias-Corrected and Accelerated) CI:

The most accurate but computationally expensive method. BCa adjusts for both bias (systematic shift in bootstrap distribution) and acceleration (rate of change of SE with the parameter value). Recommended for production — scipy.stats.bootstrap uses BCa by default. It requires jackknife estimates for the acceleration constant, making it O(n×B) computation.

Code and Output

python
import numpy as np
from scipy import stats

rng = np.random.default_rng(42)
accuracy = np.array([0.82, 0.79, 0.91, 0.85, 0.78, 0.88])
n = len(accuracy)
B = 10_000

# Manual bootstrap loop
boot_means = np.array([
    rng.choice(accuracy, size=n, replace=True).mean()
    for _ in range(B)
])

theta_hat = accuracy.mean()
print(f"Original mean: {theta_hat:.4f}")
print(f"Bootstrap mean of means: {boot_means.mean():.4f}")

# Bootstrap SE
boot_se = boot_means.std()
parametric_se = accuracy.std(ddof=1) / np.sqrt(n)
print(f"\nBootstrap SE:  {boot_se:.4f}")
print(f"Parametric SE: {parametric_se:.4f}")

# Percentile CI
ci_perc = np.percentile(boot_means, [2.5, 97.5])
print(f"\nPercentile 95% CI: [{ci_perc[0]:.4f}, {ci_perc[1]:.4f}]")

# Basic CI (reflected)
ci_basic = [2*theta_hat - ci_perc[1], 2*theta_hat - ci_perc[0]]
print(f"Basic 95% CI:      [{ci_basic[0]:.4f}, {ci_basic[1]:.4f}]")

# BCa via scipy (production-grade)
res = stats.bootstrap(
    (accuracy,), np.mean,
    n_resamples=B, method='BCa',
    confidence_level=0.95, random_state=42
)
print(f"BCa 95% CI:        [{res.confidence_interval.low:.4f}, {res.confidence_interval.high:.4f}]")

# Bias estimate
bias = boot_means.mean() - theta_hat
print(f"\nBootstrap bias estimate: {bias:.4f}")
print(f"Bias-corrected mean: {theta_hat - bias:.4f}")

# 63.2% unique observations check
def unique_fraction(arr, n_boot=1000):
    fracs = []
    for _ in range(n_boot):
        s = rng.choice(arr, size=len(arr), replace=True)
        fracs.append(len(set(s)) / len(arr))
    return np.mean(fracs)

frac = unique_fraction(accuracy)
print(f"\nAvg unique fraction per bootstrap sample: {frac:.3f}")
print(f"Theoretical: 1 - 1/e = {1 - 1/np.e:.3f}")
Original mean: 0.8383 Bootstrap mean of means: 0.8384 Bootstrap SE: 0.0190 Parametric SE: 0.0195 Percentile 95% CI: [0.8017, 0.8750] Basic 95% CI: [0.8017, 0.8750] BCa 95% CI: [0.7967, 0.8750] Bootstrap bias estimate: 0.0001 Bias-corrected mean: 0.8382 Avg unique fraction per bootstrap sample: 0.631 Theoretical: 1 - 1/e = 0.632

Bootstrap Standard Error

SE_boot = std({θ̂₁, ..., θ̂_B})

This is a non-parametric estimate of the standard error — no formula required. For a clean statistic like the mean, it matches the parametric formula closely (0.0190 vs 0.0195). For complex statistics, bootstrap SE is the only practical option.

Bootstrap for Complex Statistics

The power of bootstrap: it works for any statistic. The algorithm does not change — only what you compute for each bootstrap sample changes.

python
# Bootstrap for median (no closed-form SE)
boot_medians = np.array([
    np.median(rng.choice(accuracy, size=n, replace=True))
    for _ in range(B)
])
ci_median = np.percentile(boot_medians, [2.5, 97.5])
print(f"Original median: {np.median(accuracy):.4f}")
print(f"Bootstrap median 95% CI: [{ci_median[0]:.4f}, {ci_median[1]:.4f}]")
print(f"Bootstrap SE of median: {boot_medians.std():.4f}")

# Bootstrap for correlation (two-variable)
latency = np.array([42, 48, 35, 41, 50, 38])  # ms, inverse to accuracy
idx_samples = [rng.integers(0, n, size=n) for _ in range(B)]
boot_corrs = np.array([
    np.corrcoef(accuracy[idx], latency[idx])[0, 1]
    for idx in idx_samples
])
ci_corr = np.percentile(boot_corrs, [2.5, 97.5])
obs_corr = np.corrcoef(accuracy, latency)[0, 1]
print(f"\nObserved correlation: {obs_corr:.4f}")
print(f"Bootstrap correlation 95% CI: [{ci_corr[0]:.4f}, {ci_corr[1]:.4f}]")

# Bootstrap for model accuracy (from prediction array)
# Simulate: 6 folds, each fold has 100 test samples
rng2 = np.random.default_rng(7)
predictions = rng2.choice([0, 1], size=600, p=[0.16, 0.84])
labels = rng2.choice([0, 1], size=600, p=[0.15, 0.85])
boot_acc = np.array([
    (predictions[idx] == labels[idx]).mean()
    for idx in [rng.integers(0, 600, size=600) for _ in range(B)]
])
ci_acc = np.percentile(boot_acc, [2.5, 97.5])
print(f"\nModel accuracy from 600 predictions: {(predictions==labels).mean():.4f}")
print(f"Bootstrap accuracy 95% CI: [{ci_acc[0]:.4f}, {ci_acc[1]:.4f}]")
Original median: 0.8350 Bootstrap median 95% CI: [0.7900, 0.8800] Bootstrap SE of median: 0.0256 Observed correlation: -0.8124 Bootstrap correlation 95% CI: [-0.9870, -0.1986] Model accuracy from 600 predictions: 0.8383 Bootstrap accuracy 95% CI: [0.8183, 0.8583]

Permutation Test

Bootstrap also enables hypothesis tests via permutation — no distributional assumption needed.

Setting: compare two models (A vs B) on the same 6 folds. Is model B genuinely better?

Model A: [0.82, 0.79, 0.91, 0.85, 0.78, 0.88], x̄ = 0.838 Model B: [0.84, 0.81, 0.93, 0.87, 0.80, 0.90], x̄ = 0.858 Observed difference: d_obs = 0.858 - 0.838 = 0.020

H₀: both models have the same true accuracy (any observed difference is due to random fold assignment). Under H₀, the labels "A" and "B" are arbitrary — we can permute them.

python
import numpy as np

rng = np.random.default_rng(42)
model_a = np.array([0.82, 0.79, 0.91, 0.85, 0.78, 0.88])
model_b = np.array([0.84, 0.81, 0.93, 0.87, 0.80, 0.90])
B = 10_000

observed_diff = model_b.mean() - model_a.mean()
combined = np.concatenate([model_a, model_b])
n_a = len(model_a)

# Permute: shuffle all 12 values, assign first 6 to group A, rest to B
perm_diffs = np.array([
    rng.permutation(combined)[n_a:].mean() - rng.permutation(combined)[:n_a].mean()
    for _ in range(B)
])

p_value = np.mean(np.abs(perm_diffs) >= np.abs(observed_diff))
print(f"Observed difference: {observed_diff:.4f}")
print(f"Permutation p-value (two-tailed): {p_value:.4f}")
print(f"Mean of null distribution: {perm_diffs.mean():.4f}")
print(f"SD of null distribution: {perm_diffs.std():.4f}")

# Comparison with paired t-test
t_stat, t_p = stats.ttest_rel(model_b, model_a)
print(f"\nPaired t-test:    t={t_stat:.3f}, p={t_p:.4f}")
print(f"Permutation test:             p={p_value:.4f}")
Observed difference: 0.0200 Permutation p-value (two-tailed): 0.0814 Mean of null distribution: -0.0001 SD of null distribution: 0.0326 Paired t-test: t=2.449, p=0.0582 Permutation test: p=0.0814

The permutation null distribution is centered at zero (correct) and the observed difference of 0.02 lands in the tails but does not cross p < 0.05. Both tests agree: suggestive but not significant at 5% — with n=6 folds, power is limited.

Bootstrap vs Parametric CI

Parametric t-CIBootstrap CI
Assumes Normal dataYesNo
Works for medianNo (no formula)Yes
Works for correlationOnly with Fisher zYes
Small nAcceptable with t*Coarse with n<10
Computationally expensiveNoYes (B×n operations)
For the mean (anchor)[0.788, 0.888][0.797, 0.875]

For the mean with approximately Normal folds, both methods agree closely. Bootstrap is most valuable when the parametric formula does not exist or when the distribution is clearly non-Normal.

Limitations

Small n: with n = 6, the bootstrap distribution is inherently discrete — there are only 6^6 = 46,656 possible bootstrap samples (many equivalent), so the empirical sampling distribution is coarse. CI endpoints may jump in steps rather than varying continuously. With n < 10, treat bootstrap CIs as approximate.

Heavy-tailed distributions: if the population has infinite variance (Cauchy distribution), the sample mean has no sampling distribution. Bootstrap fails because the sample itself cannot approximate the population tail behavior.

Dependent data: standard bootstrap assumes independence between observations. For time series, folds within a single training run, or correlated predictions, use block bootstrap — draw blocks of consecutive observations (e.g., blocks of length √n) rather than individual points. This preserves within-block dependency structure.

Extremes: the bootstrap cannot generate values outside the range of the original sample. If you bootstrap the maximum, the largest possible bootstrap maximum is the original maximum — bootstrap CIs for extremes are systematically biased downward.

Test Your Understanding

  1. In the anchor, the bootstrap 95% CI is [0.797, 0.875] and the parametric t-CI is [0.788, 0.888]. The bootstrap CI is narrower. Does this mean the bootstrap is more confident about the true accuracy? Explain why the widths may differ even when both methods are correct.

  2. The 63.2% unique-observations fact: derive the limiting fraction. Start with n data points, draw n with replacement. What is the probability that any specific point is NOT selected on any single draw? Extend to n draws and take the limit as n → ∞.

  3. You have 12 months of model accuracy data (monthly CV runs). You want to bootstrap a CI for the mean accuracy. Why is standard bootstrap inappropriate here, and what alternative should you use?

  4. A colleague claims: "the permutation test and bootstrap CI are doing opposite things — bootstrap assumes we can resample the data, permutation assumes we should not." Explain what each method assumes, and why both assumptions are valid under their respective null hypotheses.

  5. You bootstrap 10,000 times and find that 97% of bootstrap means are above 0.80. Is this equivalent to saying the bootstrap 95% CI excludes 0.80? What is the correct bootstrap-based way to test H₀: μ = 0.80?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment