~/blog

Bessel's Correction

Apr 11, 2026•9 min read•By Mohammed Vasim

StatisticsMathData Science

When you compute variance in NumPy, you pass ddof=1 for sample variance and ddof=0 for population variance. Most people do this mechanically. But understanding why the denominator changes from $n$ to $n - 1$ is one of those insights that permanently improves how you think about estimation from data.

The short version: when you use the sample mean to estimate variance, you introduce a systematic bias. Dividing by $n - 1$ instead of $n$ corrects that bias. Without the correction, your variance estimate is always too small — and small variance estimates make models look more stable than they are.

The Anchor Dataset

Throughout this post, every example uses six cross-validation accuracy scores from a classifier:

python

accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88]

Mean: $\overset{x}{ˉ} = 0.838$

Why the Bias Exists: The Core Intuition

Here is the fundamental problem. You want to measure how much your CV scores vary around the true population mean $μ$ — the model's real generalization accuracy. But you do not know $μ$ . You only know $\overset{x}{ˉ}$ , the sample mean computed from those same six folds.

The sample mean $\overset{x}{ˉ}$ is calculated from the data. By construction, it is always closer to the data than $μ$ would be. Deviations from $\overset{x}{ˉ}$ are always at least as small as deviations from $μ$ , never larger.

Why? Because $\overset{x}{ˉ}$ is the value that minimizes the sum of squared deviations from any of the data points. $μ$ does not have this property — it is the true center of the population, which may or may not be close to your specific sample.

Concrete demonstration: suppose the true population mean is μ = 0.84 (a reasonable guess for this model's true generalization accuracy). The sample mean from our 6 folds is x̄ = 0.838.

$\sum (x_{i} - μ)^{2} = (0.82 - 0.84)^{2} + (0.79 - 0.84)^{2} + (0.91 - 0.84)^{2} + (0.85 - 0.84)^{2} + (0.78 - 0.84)^{2} + (0.88 - 0.84)^{2}$ $= 0.0004 + 0.0025 + 0.0049 + 0.0001 + 0.0036 + 0.0016 = 0.013100$

$\sum (x_{i} - \overset{x}{ˉ})^{2} = 0.013084 (from earlier)$

Σ(xᵢ − x̄)² = 0.013084 < Σ(xᵢ − μ)² = 0.013100. The sample mean produces a smaller sum of squared deviations than the true mean — by construction. Dividing 0.013084 by n underestimates the true variance. Bessel's correction compensates by inflating the denominator to n−1.

A Concrete Example with the CV Scores

With our six folds, the sum of squared deviations from $\overset{x}{ˉ} = 0.838$ is:

$\sum_{i = 1}^{6} (x_{i} - 0.838)^{2} = (- 0.018)^{2} + (- 0.048)^{2} + (0.072)^{2} + (0.012)^{2} + (- 0.058)^{2} + (0.042)^{2}$

$= 0.000324 + 0.002304 + 0.005184 + 0.000144 + 0.003364 + 0.001764 = 0.013084$

Dividing by $n = 6$ gives 0.002181. Dividing by $n - 1 = 5$ gives 0.002617. The uncorrected estimate is about 17% lower. With only 6 folds, that is a meaningful underestimate of the model's true variance.

The Degrees of Freedom Explanation

There is a second, equivalent way to understand $n - 1$ : degrees of freedom.

You have $n = 6$ data points. Once you have computed the sample mean $\overset{x}{ˉ}$ , a constraint is imposed on your data:

$\sum_{i = 1}^{n} (x_{i} - \overset{x}{ˉ}) = 0$

This is always exactly zero — the deviations above the mean and below the mean cancel out perfectly. This means: if you know the mean and any five of the six deviations, the sixth deviation is completely determined. You have no freedom to choose it.

So out of six deviations, only five are "free to vary" independently. You are dividing by the number of independent pieces of information — which is $n - 1 = 5$ , not $n = 6$ .

More formally: degrees of freedom is the number of independent pieces of information available after estimating a parameter. Estimating variance required first estimating the mean — that consumed one degree of freedom, leaving $n - 1$ for variance estimation.

The Math: Why n-1 Gives an Unbiased Estimate

The algebraic proof starts from a trick: add and subtract μ inside the squared deviation.

$\sum (x_{i} - \overset{x}{ˉ})^{2} = \sum (x_{i} - μ + μ - \overset{x}{ˉ})^{2} = \sum [(x_{i} - μ) - (\overset{x}{ˉ} - μ)]^{2}$

Expand the square:

$= \sum (x_{i} - μ)^{2} - 2 (\overset{x}{ˉ} - μ) \sum (x_{i} - μ) + n (\overset{x}{ˉ} - μ)^{2}$

The middle term simplifies: $\sum (x_{i} - μ) = n (\overset{x}{ˉ} - μ)$ , so the cross term becomes $- 2 n (\overset{x}{ˉ} - μ)^{2}$ . Combined:

$= \sum (x_{i} - μ)^{2} - n (\overset{x}{ˉ} - μ)^{2}$

Taking expectations (and using $E [(\overset{x}{ˉ} - μ)^{2}] = σ^{2} / n$ ):

$E [\sum (x_{i} - \overset{x}{ˉ})^{2}] = n σ^{2} - n \cdot \frac{σ ^{2}}{n} = (n - 1) σ^{2}$

Therefore:

$E [\frac{\sum ( x _{i} - x ˉ ) ^{2}}{n - 1}] = σ^{2} ✓$

With denominator $n$ instead, the result is $(n - 1) σ^{2} / n$ — always an underestimate by the factor $(n - 1) / n$ .

Does This Actually Matter in Practice?

For our 6-fold CV scenario:

$\frac{n - 1}{n} = \frac{5}{6} = 0.833$

So dividing by $n$ gives you only 83% of the true variance — an 17% underestimate. The uncorrected standard deviation would be $0.002181 \approx 0.047$ instead of $0.002617 \approx 0.051$ .

n	n−1	Bias factor (n−1)/n	Correction n/(n−1)	% inflation
6	5	0.833	1.200	20%
10	9	0.900	1.111	11%
30	29	0.967	1.034	3.4%
100	99	0.990	1.010	1.0%

With large CV fold counts, the correction barely matters. With 5–10 folds — which is typical — it matters enough to report.

Python Demonstration

python

import numpy as np

accuracy = np.array([0.82, 0.79, 0.91, 0.85, 0.78, 0.88])

var_n = np.var(accuracy, ddof=0)
var_n1 = np.var(accuracy, ddof=1)
std_n = np.std(accuracy, ddof=0)
std_n1 = np.std(accuracy, ddof=1)

print(f"Variance (ddof=0, divide by n):   {var_n:.6f}")
print(f"Variance (ddof=1, divide by n-1): {var_n1:.6f}")
print(f"Ratio:                            {var_n / var_n1:.4f}")
print(f"Expected ratio (n-1)/n:           {5/6:.4f}")
print()
print(f"Std dev (ddof=0): {std_n:.4f}")
print(f"Std dev (ddof=1): {std_n1:.4f}")

text

Variance (ddof=0, divide by n):   0.002181
Variance (ddof=1, divide by n-1): 0.002617
Ratio:                            0.8333
Expected ratio (n-1)/n:           0.8333

Std dev (ddof=0): 0.0467
Std dev (ddof=1): 0.0512

The ratio of the two variances is exactly $(n - 1) / n = 5/6 = 0.8333$ . This confirms that dividing by $n$ gives exactly $(n - 1) / n$ of the true unbiased estimate, every time.

Calculation Trace

Phase	Formula	Values	Result
Sum of squared deviations	$\sum (x_{i} - \overset{x}{ˉ})^{2}$	Six squared deviations	$0.013084$
Population variance (divide by n)	$0.013084/6$	Biased estimate	$0.002181$
Sample variance (divide by n-1)	$0.013084/5$	Unbiased estimate	$0.002617$
Bias factor	$(n - 1) / n$	$5/6$	$0.833$

When to Use n vs n−1

Situation	Formula	ddof	Python
Describing your entire dataset (no inference)	÷n (population)	0	`np.var(x, ddof=0)`
Estimating population variance from a sample	÷(n−1) (sample)	1	`np.var(x, ddof=1)`
pandas default	÷(n−1)	1	`pd.Series(x).var()`
NumPy default	÷n	0	`np.var(x)`
Batch normalization during inference	÷n (describing the batch)	0	Use population formula
CV fold variance (estimating model spread)	÷(n−1) (estimating true spread)	1	Use sample formula

If you have the entire population — not a sample — divide by N. In ML, this happens when you compute feature normalization statistics from the training set. You are describing that specific training dataset, not estimating a broader population. Use ddof=0. But when your data is a sample (CV folds, a test split), you are estimating the model's true variance — use ddof=1.

This post explains a correction that appeared in the dispersion post: why np.var(..., ddof=1) is the right call for sample data. The next post on standard deviation builds directly on this: the sample standard deviation is $s^{2}$ with $s^{2}$ computed using $n - 1$ . From here, degrees of freedom become a concept you will encounter repeatedly — in t-tests (where the t-distribution has $n - 1$ degrees of freedom), in chi-squared tests, and in ANOVA. The intuition is always the same: each estimated parameter consumes a degree of freedom.

When This Framework Breaks Down

Bessel's correction gives an unbiased estimate of variance, but "unbiased" does not mean "accurate for small samples." With $n = 3$ or $n = 4$ CV folds, even the corrected variance estimate is noisy enough to be nearly meaningless. The standard error of the variance estimator itself is large when $n$ is small. With fewer than 10 folds, bootstrap resampling gives a more honest picture of variance uncertainty than the $n - 1$ formula alone. Also, Bessel's correction assumes the observations are independent. In repeated cross-validation with overlapping folds, this assumption is violated, and the $n - 1$ formula underestimates variance.

Test Your Understanding

You have accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88]. Compute the sample variance manually by dividing the sum of squared deviations by both $n = 6$ and $n - 1 = 5$ . What is the percentage difference?
Why can you not compute a meaningful sample variance from a single observation ( $n = 1$ )? What does the formula $1/ (n - 1)$ tell you about this case?
A colleague argues: "With 100 CV folds, the correction barely matters — 99/100 is nearly 1." They propose using ddof=0 for simplicity. At what sample size does the bias from using $n$ instead of $n - 1$ drop below 1%? Is the colleague's argument valid at $n = 20$ ?
The sample mean $\overset{x}{ˉ}$ is said to be an "unbiased estimator" of $μ$ . Is the sample standard deviation $s$ an unbiased estimator of $σ$ ? (Hint: taking the square root of an unbiased estimator does not give an unbiased estimator.)

Bessel's Correction

The Anchor Dataset

Why the Bias Exists: The Core Intuition

A Concrete Example with the CV Scores

The Degrees of Freedom Explanation

The Math: Why n-1 Gives an Unbiased Estimate

Does This Actually Matter in Practice?

Python Demonstration

Calculation Trace

When to Use n vs n−1

When This Framework Breaks Down

Test Your Understanding

Comments (0)

Leave a comment

Bessel's Correction

The Anchor Dataset

Why the Bias Exists: The Core Intuition

A Concrete Example with the CV Scores

The Degrees of Freedom Explanation

The Math: Why n-1 Gives an Unbiased Estimate

Does This Actually Matter in Practice?

Python Demonstration

Calculation Trace

When to Use n vs n−1

Related Concepts

When This Framework Breaks Down

Test Your Understanding

Comments (0)

Leave a comment