← View series: statistics
~/blog
Bessel's Correction
When you compute variance in NumPy, you pass ddof=1 for sample variance and ddof=0 for population variance. Most people do this mechanically. But understanding why the denominator changes from to is one of those insights that permanently improves how you think about estimation from data.
The short version: when you use the sample mean to estimate variance, you introduce a systematic bias. Dividing by instead of corrects that bias. Without the correction, your variance estimate is always too small — and small variance estimates make models look more stable than they are.
The Anchor Dataset
Throughout this post, every example uses six cross-validation accuracy scores from a classifier:
accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88]
Mean:
Why the Bias Exists: The Core Intuition
Here is the fundamental problem. You want to measure how much your CV scores vary around the true population mean — the model's real generalization accuracy. But you do not know . You only know , the sample mean computed from those same six folds.
The sample mean is calculated from the data. By construction, it is always closer to the data than would be. Deviations from are always at least as small as deviations from , never larger.
Why? Because is the value that minimizes the sum of squared deviations from any of the data points. does not have this property — it is the true center of the population, which may or may not be close to your specific sample. So when you compute:
you get a smaller number than:
Dividing the smaller sum by gives a variance estimate that is systematically too small. Bessel's correction compensates by inflating the denominator to .
A Concrete Example with the CV Scores
With our six folds, the sum of squared deviations from is:
Dividing by gives 0.002181. Dividing by gives 0.002617. The uncorrected estimate is about 17% lower. With only 6 folds, that is a meaningful underestimate of the model's true variance.
The Degrees of Freedom Explanation
There is a second, equivalent way to understand : degrees of freedom.
You have data points. Once you have computed the sample mean , a constraint is imposed on your data:
This is always exactly zero — the deviations above the mean and below the mean cancel out perfectly. This means: if you know the mean and any five of the six deviations, the sixth deviation is completely determined. You have no freedom to choose it.
So out of six deviations, only five are "free to vary" independently. You are dividing by the number of independent pieces of information — which is , not .
More formally: degrees of freedom is the number of independent pieces of information available after estimating a parameter. Estimating variance required first estimating the mean — that consumed one degree of freedom, leaving for variance estimation.
The Math: Why n-1 Gives an Unbiased Estimate
The expected value of the sample variance with denominator equals the population variance:
With denominator instead:
The factor is always less than 1. Dividing by gives you of the true variance — always an underestimate.
Does This Actually Matter in Practice?
For our 6-fold CV scenario:
So dividing by gives you only 83% of the true variance — an 17% underestimate. The uncorrected standard deviation would be instead of .
| Underestimate | ||
|---|---|---|
| 6 | 0.833 | 17% |
| 10 | 0.900 | 10% |
| 20 | 0.950 | 5% |
| 50 | 0.980 | 2% |
| 100 | 0.990 | 1% |
With large CV fold counts, the correction barely matters. With 5–10 folds — which is typical — it matters enough to report.
Python Demonstration
import numpy as np
accuracy = np.array([0.82, 0.79, 0.91, 0.85, 0.78, 0.88])
var_n = np.var(accuracy, ddof=0)
var_n1 = np.var(accuracy, ddof=1)
std_n = np.std(accuracy, ddof=0)
std_n1 = np.std(accuracy, ddof=1)
print(f"Variance (ddof=0, divide by n): {var_n:.6f}")
print(f"Variance (ddof=1, divide by n-1): {var_n1:.6f}")
print(f"Ratio: {var_n / var_n1:.4f}")
print(f"Expected ratio (n-1)/n: {5/6:.4f}")
print()
print(f"Std dev (ddof=0): {std_n:.4f}")
print(f"Std dev (ddof=1): {std_n1:.4f}")Variance (ddof=0, divide by n): 0.002181
Variance (ddof=1, divide by n-1): 0.002617
Ratio: 0.8333
Expected ratio (n-1)/n: 0.8333
Std dev (ddof=0): 0.0467
Std dev (ddof=1): 0.0512
The ratio of the two variances is exactly . This confirms that dividing by gives exactly of the true unbiased estimate, every time.
Calculation Trace
| Phase | Formula | Values | Result |
|---|---|---|---|
| Sum of squared deviations | Six squared deviations | ||
| Population variance (divide by n) | Biased estimate | ||
| Sample variance (divide by n-1) | Unbiased estimate | ||
| Bias factor |
When Dividing by n Is Correct
If you truly have data for the entire population — not a sample — divide by . This is rare in practice. If you have accuracy scores for every possible test-set split you will ever evaluate (not just a sample of folds), those are population values and division by is correct.
In ML, you are almost always working with samples. Use .
Related Concepts
This post explains a correction that appeared in the dispersion post: why np.var(..., ddof=1) is the right call for sample data. The next post on standard deviation builds directly on this: the sample standard deviation is with computed using . From here, degrees of freedom become a concept you will encounter repeatedly — in t-tests (where the t-distribution has degrees of freedom), in chi-squared tests, and in ANOVA. The intuition is always the same: each estimated parameter consumes a degree of freedom.
When This Framework Breaks Down
Bessel's correction gives an unbiased estimate of variance, but "unbiased" does not mean "accurate for small samples." With or CV folds, even the corrected variance estimate is noisy enough to be nearly meaningless. The standard error of the variance estimator itself is large when is small. With fewer than 10 folds, bootstrap resampling gives a more honest picture of variance uncertainty than the formula alone. Also, Bessel's correction assumes the observations are independent. In repeated cross-validation with overlapping folds, this assumption is violated, and the formula underestimates variance.
Test Your Understanding
-
You have
accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88]. Compute the sample variance manually by dividing the sum of squared deviations by both and . What is the percentage difference? -
Why can you not compute a meaningful sample variance from a single observation ()? What does the formula tell you about this case?
-
A colleague argues: "With 100 CV folds, the correction barely matters — 99/100 is nearly 1." They propose using
ddof=0for simplicity. At what sample size does the bias from using instead of drop below 1%? Is the colleague's argument valid at ? -
The sample mean is said to be an "unbiased estimator" of . Is the sample standard deviation an unbiased estimator of ? (Hint: taking the square root of an unbiased estimator does not give an unbiased estimator.)
Once this clicks, you might want to explore Standard Deviation — which takes the variance's squared units and turns them back into something interpretable.
Previous: Measure Of Dispersion | Next: Standard Deviation