← View series: statistics
~/blog
Confidence Intervals
Your model achieved 87% F1 on a 500-example validation set. Someone asks: how confident are you in that number? A point estimate alone cannot answer this. If you used 50 examples, 87% F1 might easily be noise. If you used 50,000 examples, it is a tight estimate. A confidence interval makes the uncertainty explicit — and that uncertainty is the difference between a result worth acting on and one that needs more data.
The Dataset
Throughout this post:
- Model evaluated on validation examples
- Observed F1: (87% of predictions correct on the positive class)
- Population parameter: true F1 (unknown)
The Core Idea
A confidence interval is a range of plausible values for a population parameter. The "confidence" refers to the procedure's reliability over repeated experiments, not the probability for any single interval.
The general form:
For a proportion (large ):
For a mean with unknown variance:
Phase 1: Compute the Standard Error
The standard error measures how much the sample estimate would vary if you repeated the experiment.
Phase 2: Find the Critical Value
For a 95% confidence interval, the critical value is the z-score that cuts off 2.5% in each tail of the standard normal:
For other confidence levels:
| Level | z-value |
|---|---|
| 90% | 1.645 |
| 95% | 1.96 |
| 99% | 2.576 |
Phase 3: Build the Bounds
The 95% confidence interval is .
Phase 4: Visualize Coverage
The confidence interval procedure's reliability is its defining property. The diagram below shows 10 repeated 95% CIs built from different random validation samples of size 500 drawn from the same true model performance (). About 9.5 out of 10 should capture the true value.
9 out of 10 intervals contain the true F1 — exactly what 95% confidence predicts. The red interval is not a mistake; it is the expected 5% miss rate.
Full Trace Table
| Phase | Formula | Values | Result |
|---|---|---|---|
| Point estimate | correct / | ||
| Standard error | |||
| Critical value | 95% CI | ||
| Margin of error | |||
| Lower bound | |||
| Upper bound |
Python Code
import numpy as np
from scipy import stats
# Model evaluation: 500 validation examples, 87% correct
n = 500
n_correct = 435
p_hat = n_correct / n
# Standard error
se = np.sqrt(p_hat * (1 - p_hat) / n)
# Critical value
z_crit = stats.norm.ppf(0.975)
me = z_crit * se
ci_lower = p_hat - me
ci_upper = p_hat + me
print(f"Point estimate: {p_hat:.4f}")
print(f"Standard error: {se:.5f}")
print(f"Critical value (95%): {z_crit:.4f}")
print(f"Margin of error: {me:.5f}")
print(f"95% CI: [{ci_lower:.4f}, {ci_upper:.4f}]")
# Sample size for target margin of error
target_me = 0.01 # want +/- 1 percentage point
n_required = (z_crit / (2 * target_me)) ** 2
print(f"\nFor ME={target_me}: n required = {int(np.ceil(n_required))}")
# Bootstrap CI for comparison
np.random.seed(42)
correct_indicators = np.array([1]*435 + [0]*65)
bootstrap_means = [np.random.choice(correct_indicators, n, replace=True).mean()
for _ in range(10000)]
bs_lower = np.percentile(bootstrap_means, 2.5)
bs_upper = np.percentile(bootstrap_means, 97.5)
print(f"Bootstrap 95% CI: [{bs_lower:.4f}, {bs_upper:.4f}]")Point estimate: 0.8700
Standard error: 0.01504
Critical value (95%): 1.9600
Margin of error: 0.02948
95% CI: [0.8405, 0.8995]
For ME=0.01: n required = 9604
Bootstrap 95% CI: [0.8420, 0.8980]
The Interpretation That Trips People Up
A 95% confidence interval does NOT mean there is a 95% probability the true F1 is in . The true F1 is a fixed, unknown number — probability does not apply to it in the frequentist framework. The 95% refers to the procedure: if you built CIs this way over many experiments, 95% of the intervals would contain the true F1.
If you want the interpretation "there is a 95% probability the true parameter is in this range," you want a Bayesian credible interval, which requires specifying a prior (post 10).
What Affects Interval Width
Sample size : Width . Quadrupling halves the width.
Confidence level: Going from 95% to 99% multiplies the width by — 31% wider.
Variability: More variable data (lower precision classifier) produces wider intervals.
Relationship to Hypothesis Testing
A confidence interval contains all values that would NOT be rejected by a two-sided test at level :
- If some threshold (e.g., F1 = 0.80) is not in the CI the test rejects
- If 0.80 is in the CI fail to reject
Since , the lower bound, we reject at . The model is significantly above 0.80.
Related Concepts
Confidence intervals are the estimation counterpart to hypothesis tests (post 3). They build on the CLT (post 1) for their Normal approximation and the standard error formulas from estimation theory (post 2). For small samples or non-Normal data, bootstrap CIs (shown above) replace the Normal approximation without requiring distributional assumptions. The t-distribution (post 6) enters when estimating a mean with unknown variance — the same procedure but with replacing . Confidence intervals for differences between groups connect to the ANOVA and chi-square tests in later posts.
Honest Limitations
Confidence intervals assume random sampling and independent observations. A CI built on a validation set with data leakage gives false precision. For extremely imbalanced classes, F1 score confidence intervals built on the Normal approximation can extend below 0 or above 1 for small — use the Wilson interval or exact methods instead. And CIs only capture sampling uncertainty, not model uncertainty from the training process itself.
Test Your Understanding
- Your model is re-evaluated on a new 200-example validation set and achieves F1 = 0.84. Construct a 95% CI. Does it overlap with the original CI ? What does this imply about whether the two estimates are different?
- A stakeholder asks: "Is the model better than 85%?" Use the confidence interval to answer without running a separate hypothesis test.
- You want to reduce the margin of error from approximately 3 percentage points to 1 percentage point. By what factor must you increase the sample size?
- Explain in plain language to a product manager why "95% CI = " does not mean "we are 95% sure the true F1 is in this range."
- A competitor reports model accuracy of 91% with no confidence interval on a test set of 50 examples. Compute the 95% CI for their claim. What does this reveal about the reliability of their benchmark?