~/blog

Confidence Intervals

Apr 11, 2026•11 min read•By Mohammed Vasim

StatisticsMathData Science

A model achieves x̄ = 0.838 mean accuracy across 6 CV folds. Is the true accuracy near 0.838, or could it be anywhere from 0.75 to 0.90? A point estimate cannot answer this. A confidence interval can — it converts the single number into a range of plausible values, making uncertainty explicit and actionable.

The DS/ML Anchor

Six CV fold accuracy scores:

text

accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88]

Sample mean: x̄ = 0.838. Sample SD: s = 0.0477. n = 6. True μ is unknown.

Why a point estimate alone is insufficient: x̄ = 0.838 is our best single guess. But x̄ is a random variable — a different train/test split would give a different value. The interval tells you how much x̄ could vary across such experiments.

Standard Error

SE = s / √n (for means with unknown σ)

SE = 0.0477 / √6 = 0.0477 / 2.449 = 0.0195

SE is not the SD of the data. It is the SD of the sampling distribution — how much the sample mean would vary across repeated size-6 experiments. The SD = 0.0477 tells you individual fold scores vary by ±4.8%. The SE = 0.0195 tells you the mean estimate varies by ±2%.

Critical Value: z* vs t*

z (large n or known σ):* use the standard Normal. For 95% CI: z* = 1.96. This comes from Φ(1.96) = 0.975, so 95% of the standard Normal falls within ±1.96.

t (small n and unknown σ):* use the t-distribution with n−1 = 5 degrees of freedom. For 95% CI: t*(5) = 2.571.

Why t* > z*? Estimating σ from a small sample introduces extra uncertainty. The t-distribution has heavier tails to account for this — critical values are larger, giving wider intervals as a form of honesty about what we don't know.

For n=6, use t*. For n ≥ 30, z* is a good approximation.

Constructing the Interval

Formula: CI = x̄ ± t* × SE

For 95% CI with anchor (n=6, t*(5)=2.571):

Step	Formula	Values	Result
Point estimate	x̄ = Σxᵢ/n	(0.82+...+0.88)/6	0.838
Standard error	s/√n	0.0477/√6	0.0195
Critical value	t*(5, α=0.025)	from t-table	2.571
Margin of error	t* × SE	2.571 × 0.0195	0.0501
Lower bound	x̄ − ME	0.838 − 0.050	0.788
Upper bound	x̄ + ME	0.838 + 0.050	0.888

95% CI: [0.788, 0.888]

Correct Interpretation

WRONG (commonly stated): "There is a 95% probability that the true accuracy μ lies between 0.788 and 0.888."

This is wrong because μ is a fixed unknown constant — it either is or is not in the interval. Probability is not meaningful for a fixed constant.

CORRECT: "If we repeated this 6-fold CV experiment many times, approximately 95% of the resulting confidence intervals would contain the true population accuracy μ."

The randomness is in the interval (which changes with each sample), not in μ. The 95% is a property of the procedure, not of this specific interval.

Coverage Visualization

Width Trade-offs

Wider interval = more confidence but less precision. The three CIs for the anchor at different confidence levels:

What reduces width: more data (SE ∝ 1/√n — quadruple n to halve the width) or less variance in fold scores (smaller s).

Connection to Hypothesis Testing

CIs and hypothesis tests are mathematically equivalent — two views of the same inference.

Duality: a 95% CI for μ excludes μ₀ if and only if the two-sided hypothesis test (H₀: μ = μ₀) rejects at α = 0.05.

Applied to the anchor — 95% CI = [0.788, 0.888]:

μ₀ = 0.80 falls outside the CI → the test rejects H₀: μ = 0.80 at α = 0.05. We have evidence the true accuracy exceeds 0.80.
μ₀ = 0.82 falls inside the CI → the test does not reject H₀: μ = 0.82. We cannot rule out that the true accuracy is as low as 0.82.

Why CIs are better than just p-values: the p-value only says "reject or not." The CI says both whether to reject AND how far the true μ likely is from μ₀. A tight CI that excludes 0.80 tells you: the effect is real AND large enough to matter.

Bootstrap CI

When you cannot assume Normality or when the statistic is complex (median, correlation, AUC), use bootstrap:

Resample n values with replacement 1000 times
Compute the statistic for each resample
Take the 2.5th and 97.5th percentiles → 95% CI

python

import numpy as np
from scipy import stats

accuracy = np.array([0.82, 0.79, 0.91, 0.85, 0.78, 0.88])
n = len(accuracy)
rng = np.random.default_rng(42)

# Parametric t-CI
x_bar = accuracy.mean()
s = accuracy.std(ddof=1)
se = s / np.sqrt(n)
t_star = stats.t.ppf(0.975, df=n-1)
ci_param = (x_bar - t_star * se, x_bar + t_star * se)
print(f"Parametric 95% t-CI: [{ci_param[0]:.4f}, {ci_param[1]:.4f}]")

# Bootstrap 95% CI
B = 10_000
boot_means = [rng.choice(accuracy, n, replace=True).mean() for _ in range(B)]
ci_boot = (np.percentile(boot_means, 2.5), np.percentile(boot_means, 97.5))
print(f"Bootstrap 95% CI:    [{ci_boot[0]:.4f}, {ci_boot[1]:.4f}]")

# All three confidence levels
print("\nParametric CIs at three confidence levels:")
for conf, alpha in [(0.90, 0.10), (0.95, 0.05), (0.99, 0.01)]:
    t_c = stats.t.ppf(1 - alpha/2, df=n-1)
    me = t_c * se
    print(f"  {int(conf*100)}% CI: t*={t_c:.3f}, ME=±{me:.4f}, [{x_bar-me:.4f}, {x_bar+me:.4f}]")

# CI and hypothesis test duality
print("\nCI-test duality (95% CI = [0.788, 0.888]):")
for mu0 in [0.80, 0.82, 0.85]:
    inside = ci_param[0] <= mu0 <= ci_param[1]
    print(f"  μ₀={mu0}: {'inside CI → fail to reject H₀' if inside else 'outside CI → reject H₀'}")

# Prediction interval (wider than CI)
t_star_pred = stats.t.ppf(0.975, df=n-1)
pi_width = t_star_pred * s * np.sqrt(1 + 1/n)
pi = (x_bar - pi_width, x_bar + pi_width)
print(f"\n95% CI for μ:               [{ci_param[0]:.4f}, {ci_param[1]:.4f}]  width={ci_param[1]-ci_param[0]:.4f}")
print(f"95% Prediction interval:    [{pi[0]:.4f}, {pi[1]:.4f}]  width={pi[1]-pi[0]:.4f}")

text

Parametric 95% t-CI: [0.7875, 0.8882]
Bootstrap 95% CI:    [0.7983, 0.8867]

Parametric CIs at three confidence levels:
  90% CI: t*=2.015, ME=±0.0393, [0.7990, 0.8777]
  95% CI: t*=2.571, ME=±0.0501, [0.7882, 0.8885]
  99% CI: t*=4.032, ME=±0.0786, [0.7597, 0.9170]

CI-test duality (95% CI = [0.788, 0.888]):
  μ₀=0.80: outside CI → reject H₀
  μ₀=0.82: inside CI → fail to reject H₀
  μ₀=0.85: inside CI → fail to reject H₀

95% CI for μ:               [0.7875, 0.8882]  width=0.1007
95% Prediction interval:    [0.7146, 0.9621]  width=0.2475

Sample Size Calculation

Formula: n = (t* × s / ME)² where ME is the desired margin of error.

Example: how many CV folds to estimate μ within ±0.02 at 95% confidence, using s = 0.048?

Using z* = 1.96 (approximation for large n): n = (1.96 × 0.048 / 0.02)² = (4.704)² = 22.1 → n ≥ 23 folds

To halve the margin of error from ±0.02 to ±0.01, you need 4× as many folds: n ≈ 92.

Prediction Interval vs Confidence Interval

A distinction practitioners confuse constantly:

	CI for μ	Prediction Interval for new observation
Captures	The true mean μ	A single new fold score
Width	Shrinks with n (∝ 1/√n)	Never shrinks below ±1.96σ
Formula	x̄ ± t* × s/√n	x̄ ± t* × s × √(1+1/n)
Anchor	[0.788, 0.888]	[0.715, 0.962]
Use when	"What is true avg accuracy?"	"What range should I expect next fold?"

The extra 1 under the square root in the PI accounts for the variability of the new observation — you cannot average out individual fold noise by collecting more folds.

Limitations

t-CI assumes Normality of fold scores. For n=6, this matters. If fold scores are skewed, the bootstrap CI is more reliable.
Independence of folds required. CV fold scores from the same training run are approximately independent only if the splits are done carefully. Repeated use of the same validation set violates independence.
95% is a long-run guarantee. For any specific experiment, the interval either contains μ or it does not. The 95% refers to what would happen across many experiments — not this one.

Test Your Understanding

The 6-fold anchor gives 95% CI = [0.788, 0.888]. A colleague says "the probability that the true accuracy is at least 0.80 is about 95%." Correct this statement using the proper frequentist interpretation.
With n=6 folds and SE = 0.0195, how many folds would you need to achieve a margin of error of ±0.01 at 95% confidence? Show your algebra.
Explain why the 95% CI for μ is narrower than the 95% prediction interval for a single new fold. What physical quantity does each capture, and why does the PI width not converge to zero as n → ∞?
The CI [0.788, 0.888] corresponds to t*(5)=2.571. If you instead used z*=1.96, the CI would be [0.800, 0.876]. Why would using z* here be incorrect? When would using z* instead of t* be acceptable?
You test H₀: μ = 0.80 using a two-tailed t-test on the anchor data and reject at α=0.05. Your colleague computes the 95% CI and says "I cannot use the CI to determine the p-value." Are they correct? Explain the duality relationship and show how to determine whether p > 0.05 or p < 0.05 from the CI without computing the test statistic.