~/blog

Estimates and Estimators

Apr 11, 2026•11 min read•By Mohammed Vasim

StatisticsMathData Science

You want to know a model's true generalization accuracy — its expected performance on all possible test inputs. You cannot observe this directly. You have six CV fold scores. From these six numbers you must make the best possible inference about the unknown population mean μ. That is the estimation problem, and it has two branches: finding a single best guess (point estimation) and finding a plausible range (interval estimation).

The DS/ML Anchor

Six CV fold accuracy scores:

text

accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88]

μ (true generalization accuracy) is unknown. These 6 folds are our sample. We want to estimate μ.

Point Estimators for μ

A point estimator is a function of the sample that produces a single-number guess for a population parameter.

Four common estimators, all computed on anchor:

Estimator	Formula	Anchor Value	When to Use
Sample mean (x̄)	Σxᵢ/n	(0.82+0.79+0.91+0.85+0.78+0.88)/6 = 0.838	Default for symmetric distributions
Sample median	Middle value (sorted)	sorted=[0.78,0.79,0.82,0.85,0.88,0.91]; median=(0.82+0.85)/2 = 0.835	When outliers are present
Trimmed mean (10%)	Drop bottom/top 10%, average rest	Drop 0.78 and 0.91; mean=[0.79,0.82,0.85,0.88]/4 = 0.835	When some folds may be corrupted
Midrange	(max+min)/2	(0.91+0.78)/2 = 0.845	Rarely used — very sensitive to outliers

All four disagree. There is no single "correct" estimator — the right choice depends on assumptions about the data.

Properties of Good Estimators

1. Unbiasedness

An estimator θ̂ is unbiased if E[θ̂] = θ — its expected value equals the true parameter.

Proof that x̄ is unbiased for μ:

E[x̄] = E[(1/n) × (X₁ + X₂ + ... + X_n)]

= (1/n) × Σ E[Xᵢ] (linearity of expectation)

= (1/n) × n × μ = μ ✓

Biased example — naive variance (dividing by n):

σ̂² = Σ(xᵢ−x̄)²/n underestimates σ² because x̄ is derived from the same sample (it minimizes the sum of squared deviations by construction). The corrected estimator s² = Σ(xᵢ−x̄)²/(n−1) is unbiased. See the Bessel's correction post for the full proof.

Biased-but-useful trade-off: the MLE of σ² (divides by n) is biased but has lower variance than s² — it is a better estimator by MSE for large n. This is the estimation analogue of the ML bias-variance trade-off.

2. Consistency

An estimator θ̂_n is consistent if θ̂_n converges to θ in probability as n → ∞.

The sample mean is consistent: by the Law of Large Numbers, x̄ → μ as n → ∞.

Practical meaning for CV: with n=6 folds, SE = σ/√6 ≈ 0.020 — your accuracy estimate has substantial uncertainty. With n=100 folds, SE = σ/√100 = 0.005 — four times more precise.

3. Efficiency

Among all unbiased estimators, the most efficient one has the smallest variance.

The Cramér-Rao Lower Bound (CRLB) gives the minimum variance any unbiased estimator can achieve:

Var(θ̂) ≥ 1 / I(θ)

where I(θ) is the Fisher Information — a measure of how much information the data carries about θ.

For estimating μ in a Normal(μ, σ²) population: the sample mean x̄ achieves the CRLB with Var(x̄) = σ²/n. No other unbiased estimator can do better. This is what makes x̄ the natural default — it is the Most Variance-Unbiased Estimator (MVUE) for Normal μ.

Practical implication: if you have two unbiased estimators (e.g., mean and median for symmetric data), choose the one with smaller variance — you get more information per observation.

4. Sufficiency

A sufficient statistic captures all information in the data about the parameter. Knowing x̄ and n, you gain no additional information about μ from knowing each individual fold score.

For the accuracy anchor: when reporting to a colleague, providing x̄=0.838 and s=0.048 (with n=6) is sufficient for inference about μ — you do not need to share all six fold scores.

Sufficiency is why aggregated statistics (mean, SD) are standard reporting formats — not because we are hiding data, but because they contain the same information for the questions we are asking.

Bias-Variance Tradeoff in Estimation

MSE(θ̂) = Bias(θ̂)² + Var(θ̂)

A biased estimator with low variance can achieve lower MSE than an unbiased estimator with high variance. Two concrete examples:

Example 1 — MLE vs unbiased σ²:

Estimator	Bias	Variance	MSE
MLE: Σ(xᵢ−x̄)²/n	−σ²/n	2σ⁴/n	2σ⁴/n + σ⁴/n²
Unbiased: Σ(xᵢ−x̄)²/(n−1)	0	2σ⁴/(n−1)	2σ⁴/(n−1)

For large n, both converge. For small n (like n=6), the unbiased estimator has higher variance and higher MSE despite zero bias.

Example 2 — Ridge regression: Ridge introduces bias into coefficients (shrinks toward zero) in exchange for much lower variance. The biased regularized estimator often achieves lower MSE on new data — this is the ML bias-variance trade-off expressed in estimation theory.

Maximum Likelihood Estimation (MLE)

Principle: choose the parameter value that makes the observed data most probable.

Likelihood: L(μ, σ² | data) = Π f(xᵢ | μ, σ²) — the probability of observing exactly this data.

Log-likelihood (products → sums, easier to maximize):

ℓ(μ) = −n/2 × log(2πσ²) − Σ(xᵢ−μ)² / (2σ²)

Derive μ̂_MLE on the anchor:

∂ℓ/∂μ = Σ(xᵢ−μ) / σ² = 0

→ Σ(xᵢ−μ) = 0 → Σxᵢ = nμ → μ̂_MLE = x̄ = 0.838

For Normal distributions, the MLE of μ is the sample mean. This is not a coincidence — it is the formal justification for x̄ as the natural point estimate.

MLE of σ²: differentiating with respect to σ² gives σ̂²_MLE = Σ(xᵢ−x̄)²/n = divides by n, not n−1. This is biased (underestimates σ²), but it is what the maximum likelihood criterion selects.

Method of Moments (MoM)

Principle: match sample moments (mean, variance) to theoretical population moments and solve for parameters.

For Normal(μ, σ²):

First moment: E[X] = μ → set x̄ = μ → μ̂_MoM = x̄ = 0.838 (same as MLE)
Second moment: E[X²] = σ² + μ² → var = σ² → σ̂²_MoM = s² (or biased version)

When MoM ≠ MLE:

Gamma distribution: MLE requires numerical optimization; MoM gives closed-form estimates from (x̄, s²) that are slightly less efficient but easy to compute.
Beta distribution: MoM solves a 2×2 system from (x̄, s²) immediately; MLE needs numerical iteration.

Use MoM when: MLE is analytically intractable or computationally expensive, and approximate efficiency is acceptable.

Estimator Properties Table

Property	Sample Mean (x̄)	Naive Variance (÷n)	Sample Variance (÷n−1)
Unbiased	Yes	No	Yes
Consistent	Yes	Yes	Yes
Efficient (among unbiased)	Yes (Normal μ)	— (biased)	Yes
MSE	Var = σ²/n	Bias²+Var	Var = 2σ⁴/(n−1)

Bias-Variance Trace Table

Estimator	Bias	Variance	MSE	Best when
Sample mean (x̄)	0	σ²/n	σ²/n	Default for μ
MLE of σ² (÷n)	−σ²/n	2σ⁴/n	2σ⁴/n + σ⁴/n²	Large n, MSE focus
Sample variance (÷n−1)	0	2σ⁴/(n−1)	2σ⁴/(n−1)	Default for σ²
Ridge regr. coefficients	non-zero	low	often lower total	High-dim, correlated features

Code

python

import numpy as np
from scipy import stats

accuracy = np.array([0.82, 0.79, 0.91, 0.85, 0.78, 0.88])
n = len(accuracy)

# Four point estimators
x_bar = accuracy.mean()
median = np.median(accuracy)
trim_mean = stats.trim_mean(accuracy, proportiontocut=0.1)
midrange = (accuracy.max() + accuracy.min()) / 2

print("Point estimators:")
print(f"  Sample mean:   {x_bar:.4f}")
print(f"  Sample median: {median:.4f}")
print(f"  Trimmed mean:  {trim_mean:.4f}")
print(f"  Midrange:      {midrange:.4f}")

# Biased vs unbiased variance
var_biased   = np.var(accuracy, ddof=0)   # MLE: divides by n
var_unbiased = np.var(accuracy, ddof=1)   # sample variance: divides by n-1

print(f"\nVariance estimators (n={n}):")
print(f"  MLE (÷n):     {var_biased:.6f}  (biased downward)")
print(f"  Unbiased (÷n-1): {var_unbiased:.6f}")
print(f"  Ratio: {var_unbiased/var_biased:.4f}  (= n/(n-1) = {n/(n-1):.4f})")

# Unbiasedness verification by simulation
rng = np.random.default_rng(42)
true_mu, true_sigma = 0.85, 0.048
means = [rng.normal(true_mu, true_sigma, n).mean() for _ in range(10_000)]
print(f"\nUnbiasedness check (10k simulations):")
print(f"  E[x̄] = {np.mean(means):.4f}  (true μ = {true_mu})")

# MLE derivation
print(f"\nMLE for μ (Normal assumption):")
print(f"  ∂ℓ/∂μ = 0  →  μ̂_MLE = x̄ = {x_bar:.4f}")
print(f"  σ̂²_MLE (÷n) = {var_biased:.6f}")
print(f"  σ̂² unbiased (÷n-1) = {var_unbiased:.6f}")

# MSE comparison
sigma_sq = true_sigma**2
mle_bias = -sigma_sq / n
mle_var = 2 * sigma_sq**2 / n
mle_mse = mle_bias**2 + mle_var
unbiased_var = 2 * sigma_sq**2 / (n - 1)

print(f"\nMSE comparison (true σ²={sigma_sq:.6f}):")
print(f"  MLE σ̂²:      MSE = {mle_mse:.8f}")
print(f"  Unbiased s²: MSE = {unbiased_var:.8f}")
print(f"  Winner for n={n}: {'MLE' if mle_mse < unbiased_var else 'Unbiased'}")