Back to blog
← View series: statistics

~/blog

Estimation

Apr 11, 20268 min readBy mohammed.vasim
StatisticsMathData Science

You trained a sentiment classifier on 1,000 labeled examples and measured 87% validation accuracy. That number is an estimate — a single value computed from a finite sample, standing in for some unknowable true accuracy over all possible inputs your model will ever see. Whether that estimate is trustworthy, how much uncertainty surrounds it, and how you would tighten it with more data are all questions about estimation theory. Every benchmark result, every A/B test conclusion, and every model comparison report runs on this machinery.

Point Estimates

A point estimate is a single number computed from sample data that estimates a population parameter. For model evaluation, the most common estimates are:

ParameterEstimatorFormula
Mean accuracy Sample mean
Variance of errors Sample variance
True positive rate Sample proportion

Notice the in the variance formula. Using produces a biased estimate of the true variance. The formula with is unbiased. This denominator choice is called Bessel's correction and matters more in small validation sets than large ones.

What Makes a Good Estimator?

Not all estimates are created equal. For ML evaluation, the properties that matter most are:

Unbiasedness means the estimator gets it right on average:

If you repeated your validation experiment across many different random splits, an unbiased estimator's average would equal the true parameter. Sample mean accuracy is unbiased for true mean accuracy.

Consistency means the estimator improves as you get more data:

A consistent estimator eventually gets arbitrarily close to the truth. If your validation set grows from 100 to 10,000 examples, your accuracy estimate should converge. Any estimator that does not improve with more data is suspect.

Efficiency means minimum variance among unbiased options. Between two unbiased estimators, prefer the one that varies less across different samples.

Sufficiency: A sufficient statistic uses all the information in the sample about the parameter. The sample mean is sufficient for the population mean under i.i.d. Normal data.

Interval Estimates: Confidence Intervals

Point estimates tell you nothing about uncertainty. If you report 87% accuracy, does that mean the true accuracy is definitely near 87%, or could it be anywhere from 75% to 99%? A confidence interval answers that.

The general form:

For accuracy (a proportion) with large :

For a mean with unknown variance:

The interpretation that trips people up: a 95% confidence interval does not mean there is a 95% probability the true parameter is in the interval. The true parameter is fixed — we just do not know it. What "95% confidence" means is that if you repeated this sampling process many times and built an interval each time, 95% of those intervals would contain the true parameter.

true accuracy 10 validation folds — 9 of 10 intervals capture the true model accuracy fold 1 fold 2 fold 3 fold 4 misses fold 5 fold 6 fold 7 fold 8 fold 9 fold 10

Working Through the Classifier Example

Your sentiment classifier achieved 87% accuracy on a held-out validation set of examples. What is the 95% confidence interval for the true accuracy?

PhaseFormulaValuesResult
Point estimate correct / total
Standard error
Critical value for 95% CIStandard normal table
Margin of error
Confidence interval

The true accuracy is estimated to lie between 84.1% and 89.9%. If a competitor claims 90% accuracy, this interval shows your model might match that or might be below it — you cannot tell from 500 examples.

0.80 0.85 0.90 0.95 84.1% 87.0% 89.9% 95% CI for classifier accuracy (n=500)

Methods of Estimation

Method of Moments equates sample moments to population moments and solves. Simple, but not always the most efficient approach.

Maximum Likelihood Estimation (MLE) finds parameter values that maximize the probability of observing your data:

MLE has strong asymptotic properties and is the basis for most model training objectives. For logistic regression, the cross-entropy loss is exactly the negative log-likelihood under a Bernoulli model.

Least Squares minimizes sum of squared deviations. Equivalent to MLE when errors are Normal.

Bayesian Estimation combines prior beliefs with observed data:

This gives a full posterior distribution rather than a point estimate — more informative but requires specifying priors.

The Bias-Variance Tradeoff

Mean Squared Error decomposes as:

You can often reduce variance by introducing some bias. Ridge regression does exactly this: it shrinks coefficient estimates toward zero, accepting some bias in exchange for lower variance. This tradeoff is not unique to ML — it shows up whenever you choose between estimators.

Bootstrap: Let the Data Speak

Bootstrap estimation resamples your validation set with replacement, computes the statistic for each resample, and uses the resulting distribution to estimate uncertainty. It makes no parametric assumptions.

python
import numpy as np

np.random.seed(42)

# Simulated binary predictions: 1 = correct, 0 = incorrect
# 87% accuracy on 500 examples
n_val = 500
correct_rate = 0.87
predictions_correct = np.random.binomial(1, correct_rate, n_val)

def bootstrap_ci(data, statistic_func, n_bootstrap=10000, confidence=0.95):
    n = len(data)
    bootstrap_stats = []
    for _ in range(n_bootstrap):
        resample = np.random.choice(data, size=n, replace=True)
        bootstrap_stats.append(statistic_func(resample))
    alpha = 1 - confidence
    lower = np.percentile(bootstrap_stats, 100 * alpha / 2)
    upper = np.percentile(bootstrap_stats, 100 * (1 - alpha / 2))
    return lower, upper

lower, upper = bootstrap_ci(predictions_correct, np.mean)
print(f"Point estimate: {predictions_correct.mean():.4f}")
print(f"95% Bootstrap CI: [{lower:.4f}, {upper:.4f}]")
Point estimate: 0.8740 95% Bootstrap CI: [0.8440, 0.9020]

Sample Size for a Target Margin of Error

You want your accuracy estimate to have a margin of error no larger than 1 percentage point at 95% confidence. How many validation examples do you need? Using the conservative :

The relationship is — halving the margin of error quadruples the required sample size.

When Estimates Can Mislead

Estimates rest on random sampling and independence of observations. A validation set drawn from a different time period than your training data produces biased accuracy estimates — even if the formula is applied correctly. A confidence interval built on biased data is still biased, dressed up in statistical notation.

Always think about where your validation data came from before trusting any estimate. Distribution shift, data leakage, and label noise all corrupt the estimate before any math is applied.

Estimation builds directly on the Central Limit Theorem (post 1), which guarantees that sample means and proportions follow Normal distributions for large enough samples — that is what justifies the z-based confidence intervals above. Estimation is the prerequisite for hypothesis testing (post 3): every test statistic is a standardized estimate. It also connects forward to confidence intervals (post 11), where the mechanics of margin-of-error computation are developed in more detail, and to power analysis, where you learn how sample size controls the precision of future estimates before you collect data.

Honest Limitations

Point estimates and confidence intervals are only as good as the sampling procedure. They cannot fix selection bias, data leakage, or label noise. With very small validation sets (n < 30), the Normal approximation for proportions breaks down — use exact binomial intervals (Clopper-Pearson) instead. And confidence intervals say nothing about practical significance: a statistically tight estimate of a tiny effect is still a tiny effect.

Test Your Understanding

  1. Your model achieves 92% accuracy on 200 validation examples. Compute the 95% confidence interval for the true accuracy using the Normal approximation.
  2. A colleague reports 88% ± 3% accuracy but does not say what confidence level was used. What additional information do you need to interpret this interval?
  3. You want to estimate model accuracy within ±0.5 percentage points at 99% confidence. How many validation examples are required?
  4. Two models are tested on the same 1000-example validation set. Model A: 85%, Model B: 87%. Their 95% CIs overlap substantially. What does this tell you about whether the true accuracy difference is meaningful?
  5. Explain why the sample variance formula divides by rather than . What goes wrong with ML evaluation if you always divide by ?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment