Back to blog
← View series: statistics

~/blog

Bessel's Correction

Apr 11, 20267 min readBy mohammed.vasim
StatisticsMathData Science

When you compute variance in NumPy, you pass ddof=1 for sample variance and ddof=0 for population variance. Most people do this mechanically. But understanding why the denominator changes from to is one of those insights that permanently improves how you think about estimation from data.

The short version: when you use the sample mean to estimate variance, you introduce a systematic bias. Dividing by instead of corrects that bias. Without the correction, your variance estimate is always too small — and small variance estimates make models look more stable than they are.

The Anchor Dataset

Throughout this post, every example uses six cross-validation accuracy scores from a classifier:

accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88]

Mean:

Why the Bias Exists: The Core Intuition

Here is the fundamental problem. You want to measure how much your CV scores vary around the true population mean — the model's real generalization accuracy. But you do not know . You only know , the sample mean computed from those same six folds.

The sample mean is calculated from the data. By construction, it is always closer to the data than would be. Deviations from are always at least as small as deviations from , never larger.

Why? Because is the value that minimizes the sum of squared deviations from any of the data points. does not have this property — it is the true center of the population, which may or may not be close to your specific sample. So when you compute:

you get a smaller number than:

Dividing the smaller sum by gives a variance estimate that is systematically too small. Bessel's correction compensates by inflating the denominator to .

A Concrete Example with the CV Scores

With our six folds, the sum of squared deviations from is:

Divide by n = 6 0.013084 / 6 = 0.002181 Divide by n-1 = 5 0.013084 / 5 = 0.002617 Biased — too small Unbiased estimate

Dividing by gives 0.002181. Dividing by gives 0.002617. The uncorrected estimate is about 17% lower. With only 6 folds, that is a meaningful underestimate of the model's true variance.

The Degrees of Freedom Explanation

There is a second, equivalent way to understand : degrees of freedom.

You have data points. Once you have computed the sample mean , a constraint is imposed on your data:

This is always exactly zero — the deviations above the mean and below the mean cancel out perfectly. This means: if you know the mean and any five of the six deviations, the sixth deviation is completely determined. You have no freedom to choose it.

So out of six deviations, only five are "free to vary" independently. You are dividing by the number of independent pieces of information — which is , not .

Six deviations from mean=0.838. Five are free — the sixth is determined. -0.018 free -0.048 free +0.072 free +0.012 free -0.058 free +0.042 forced Sum of all six = 0.000 (constraint forces the last deviation) Check: -0.018 - 0.048 + 0.072 + 0.012 - 0.058 + 0.042 = 0.002 ≈ 0 (rounding)

More formally: degrees of freedom is the number of independent pieces of information available after estimating a parameter. Estimating variance required first estimating the mean — that consumed one degree of freedom, leaving for variance estimation.

The Math: Why n-1 Gives an Unbiased Estimate

The expected value of the sample variance with denominator equals the population variance:

With denominator instead:

The factor is always less than 1. Dividing by gives you of the true variance — always an underestimate.

Does This Actually Matter in Practice?

For our 6-fold CV scenario:

So dividing by gives you only 83% of the true variance — an 17% underestimate. The uncorrected standard deviation would be instead of .

Underestimate
60.83317%
100.90010%
200.9505%
500.9802%
1000.9901%

With large CV fold counts, the correction barely matters. With 5–10 folds — which is typical — it matters enough to report.

Python Demonstration

python
import numpy as np

accuracy = np.array([0.82, 0.79, 0.91, 0.85, 0.78, 0.88])

var_n = np.var(accuracy, ddof=0)
var_n1 = np.var(accuracy, ddof=1)
std_n = np.std(accuracy, ddof=0)
std_n1 = np.std(accuracy, ddof=1)

print(f"Variance (ddof=0, divide by n):   {var_n:.6f}")
print(f"Variance (ddof=1, divide by n-1): {var_n1:.6f}")
print(f"Ratio:                            {var_n / var_n1:.4f}")
print(f"Expected ratio (n-1)/n:           {5/6:.4f}")
print()
print(f"Std dev (ddof=0): {std_n:.4f}")
print(f"Std dev (ddof=1): {std_n1:.4f}")
Variance (ddof=0, divide by n): 0.002181 Variance (ddof=1, divide by n-1): 0.002617 Ratio: 0.8333 Expected ratio (n-1)/n: 0.8333 Std dev (ddof=0): 0.0467 Std dev (ddof=1): 0.0512

The ratio of the two variances is exactly . This confirms that dividing by gives exactly of the true unbiased estimate, every time.

Calculation Trace

PhaseFormulaValuesResult
Sum of squared deviationsSix squared deviations
Population variance (divide by n)Biased estimate
Sample variance (divide by n-1)Unbiased estimate
Bias factor

When Dividing by n Is Correct

If you truly have data for the entire population — not a sample — divide by . This is rare in practice. If you have accuracy scores for every possible test-set split you will ever evaluate (not just a sample of folds), those are population values and division by is correct.

In ML, you are almost always working with samples. Use .

This post explains a correction that appeared in the dispersion post: why np.var(..., ddof=1) is the right call for sample data. The next post on standard deviation builds directly on this: the sample standard deviation is with computed using . From here, degrees of freedom become a concept you will encounter repeatedly — in t-tests (where the t-distribution has degrees of freedom), in chi-squared tests, and in ANOVA. The intuition is always the same: each estimated parameter consumes a degree of freedom.

When This Framework Breaks Down

Bessel's correction gives an unbiased estimate of variance, but "unbiased" does not mean "accurate for small samples." With or CV folds, even the corrected variance estimate is noisy enough to be nearly meaningless. The standard error of the variance estimator itself is large when is small. With fewer than 10 folds, bootstrap resampling gives a more honest picture of variance uncertainty than the formula alone. Also, Bessel's correction assumes the observations are independent. In repeated cross-validation with overlapping folds, this assumption is violated, and the formula underestimates variance.

Test Your Understanding

  1. You have accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88]. Compute the sample variance manually by dividing the sum of squared deviations by both and . What is the percentage difference?

  2. Why can you not compute a meaningful sample variance from a single observation ()? What does the formula tell you about this case?

  3. A colleague argues: "With 100 CV folds, the correction barely matters — 99/100 is nearly 1." They propose using ddof=0 for simplicity. At what sample size does the bias from using instead of drop below 1%? Is the colleague's argument valid at ?

  4. The sample mean is said to be an "unbiased estimator" of . Is the sample standard deviation an unbiased estimator of ? (Hint: taking the square root of an unbiased estimator does not give an unbiased estimator.)


Once this clicks, you might want to explore Standard Deviation — which takes the variance's squared units and turns them back into something interpretable.


Previous: Measure Of Dispersion | Next: Standard Deviation

Comments (0)

No comments yet. Be the first to comment!

Leave a comment