Back to blog
← View series: statistics

~/blog

Confidence Intervals

Apr 11, 20268 min readBy mohammed.vasim
StatisticsMathData Science

Your model achieved 87% F1 on a 500-example validation set. Someone asks: how confident are you in that number? A point estimate alone cannot answer this. If you used 50 examples, 87% F1 might easily be noise. If you used 50,000 examples, it is a tight estimate. A confidence interval makes the uncertainty explicit — and that uncertainty is the difference between a result worth acting on and one that needs more data.

The Dataset

Throughout this post:

  • Model evaluated on validation examples
  • Observed F1: (87% of predictions correct on the positive class)
  • Population parameter: true F1 (unknown)

The Core Idea

A confidence interval is a range of plausible values for a population parameter. The "confidence" refers to the procedure's reliability over repeated experiments, not the probability for any single interval.

The general form:

For a proportion (large ):

For a mean with unknown variance:

Phase 1: Compute the Standard Error

The standard error measures how much the sample estimate would vary if you repeated the experiment.

SE computation numerator 0.87 x 0.13 = 0.1131 / sample size n = 500 = variance 0.0002262 sqrt SE 0.01504 SE = sqrt(0.87 x 0.13 / 500) = 0.01504

Phase 2: Find the Critical Value

For a 95% confidence interval, the critical value is the z-score that cuts off 2.5% in each tail of the standard normal:

For other confidence levels:

Levelz-value
90%1.645
95%1.96
99%2.576
95% of distribution -1.96 +1.96 2.5% 2.5%

Phase 3: Build the Bounds

The 95% confidence interval is .

0.80 0.84 0.87 0.90 84.0% 87.0% 90.0% 95% CI for model F1 (n=500)

Phase 4: Visualize Coverage

The confidence interval procedure's reliability is its defining property. The diagram below shows 10 repeated 95% CIs built from different random validation samples of size 500 drawn from the same true model performance (). About 9.5 out of 10 should capture the true value.

10 Repeated 95% CIs — Validation Samples from Same Model True F1 = 0.87 0.82 0.85 0.87 0.89 0.92 1 2 3 4 (misses) 5 6 7 8 9 10 Contains true F1 Misses true F1

9 out of 10 intervals contain the true F1 — exactly what 95% confidence predicts. The red interval is not a mistake; it is the expected 5% miss rate.

Full Trace Table

PhaseFormulaValuesResult
Point estimate correct /
Standard error
Critical value95% CI
Margin of error
Lower bound
Upper bound

Python Code

python
import numpy as np
from scipy import stats

# Model evaluation: 500 validation examples, 87% correct
n = 500
n_correct = 435
p_hat = n_correct / n

# Standard error
se = np.sqrt(p_hat * (1 - p_hat) / n)

# Critical value
z_crit = stats.norm.ppf(0.975)
me = z_crit * se

ci_lower = p_hat - me
ci_upper = p_hat + me

print(f"Point estimate: {p_hat:.4f}")
print(f"Standard error: {se:.5f}")
print(f"Critical value (95%): {z_crit:.4f}")
print(f"Margin of error: {me:.5f}")
print(f"95% CI: [{ci_lower:.4f}, {ci_upper:.4f}]")

# Sample size for target margin of error
target_me = 0.01  # want +/- 1 percentage point
n_required = (z_crit / (2 * target_me)) ** 2
print(f"\nFor ME={target_me}: n required = {int(np.ceil(n_required))}")

# Bootstrap CI for comparison
np.random.seed(42)
correct_indicators = np.array([1]*435 + [0]*65)
bootstrap_means = [np.random.choice(correct_indicators, n, replace=True).mean()
                   for _ in range(10000)]
bs_lower = np.percentile(bootstrap_means, 2.5)
bs_upper = np.percentile(bootstrap_means, 97.5)
print(f"Bootstrap 95% CI: [{bs_lower:.4f}, {bs_upper:.4f}]")
Point estimate: 0.8700 Standard error: 0.01504 Critical value (95%): 1.9600 Margin of error: 0.02948 95% CI: [0.8405, 0.8995] For ME=0.01: n required = 9604 Bootstrap 95% CI: [0.8420, 0.8980]

The Interpretation That Trips People Up

A 95% confidence interval does NOT mean there is a 95% probability the true F1 is in . The true F1 is a fixed, unknown number — probability does not apply to it in the frequentist framework. The 95% refers to the procedure: if you built CIs this way over many experiments, 95% of the intervals would contain the true F1.

If you want the interpretation "there is a 95% probability the true parameter is in this range," you want a Bayesian credible interval, which requires specifying a prior (post 10).

What Affects Interval Width

Sample size : Width . Quadrupling halves the width.

Confidence level: Going from 95% to 99% multiplies the width by — 31% wider.

Variability: More variable data (lower precision classifier) produces wider intervals.

Relationship to Hypothesis Testing

A confidence interval contains all values that would NOT be rejected by a two-sided test at level :

  • If some threshold (e.g., F1 = 0.80) is not in the CI the test rejects
  • If 0.80 is in the CI fail to reject

Since , the lower bound, we reject at . The model is significantly above 0.80.

Confidence intervals are the estimation counterpart to hypothesis tests (post 3). They build on the CLT (post 1) for their Normal approximation and the standard error formulas from estimation theory (post 2). For small samples or non-Normal data, bootstrap CIs (shown above) replace the Normal approximation without requiring distributional assumptions. The t-distribution (post 6) enters when estimating a mean with unknown variance — the same procedure but with replacing . Confidence intervals for differences between groups connect to the ANOVA and chi-square tests in later posts.

Honest Limitations

Confidence intervals assume random sampling and independent observations. A CI built on a validation set with data leakage gives false precision. For extremely imbalanced classes, F1 score confidence intervals built on the Normal approximation can extend below 0 or above 1 for small — use the Wilson interval or exact methods instead. And CIs only capture sampling uncertainty, not model uncertainty from the training process itself.

Test Your Understanding

  1. Your model is re-evaluated on a new 200-example validation set and achieves F1 = 0.84. Construct a 95% CI. Does it overlap with the original CI ? What does this imply about whether the two estimates are different?
  2. A stakeholder asks: "Is the model better than 85%?" Use the confidence interval to answer without running a separate hypothesis test.
  3. You want to reduce the margin of error from approximately 3 percentage points to 1 percentage point. By what factor must you increase the sample size?
  4. Explain in plain language to a product manager why "95% CI = " does not mean "we are 95% sure the true F1 is in this range."
  5. A competitor reports model accuracy of 91% with no confidence interval on a test set of 50 examples. Compute the 95% CI for their claim. What does this reveal about the reliability of their benchmark?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment