← View series: statistics
~/blog
Standard Deviation
Variance tells you how spread out your data is, but it does so in squared units — accuracy-squared, loss-squared, dollar-squared — which are not interpretable on their own. Standard deviation fixes this by taking the square root of variance, returning to the original units. A standard deviation of 0.051 in accuracy units means the typical CV fold is about 5 percentage points away from the mean. You can picture that. You cannot picture 0.002617 accuracy-squared.
The Anchor Dataset
Throughout this post, every calculation uses six cross-validation accuracy scores from a classifier:
accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88]
Mean: , Sample variance:
The Core Formula
Step-by-Step Calculation
Step 1 — Compute the mean:
Step 2 — Compute each deviation and square it:
| 0.82 | -0.018 | 0.000324 |
| 0.79 | -0.048 | 0.002304 |
| 0.91 | +0.072 | 0.005184 |
| 0.85 | +0.012 | 0.000144 |
| 0.78 | -0.058 | 0.003364 |
| 0.88 | +0.042 | 0.001764 |
| Sum | 0.013084 |
Step 3 — Divide by n-1 to get sample variance, then take square root:
A standard deviation of 0.051 means: across these six CV folds, the typical fold deviates from the mean accuracy by about ±5 percentage points.
The 68-95-99.7 Rule
For normally distributed data, standard deviation has a powerful interpretation:
- About 68% of values fall within 1 standard deviation of the mean
- About 95% of values fall within 2 standard deviations
- About 99.7% of values fall within 3 standard deviations
For our model: mean = 0.838, s = 0.051. If accuracy were normally distributed:
- 68% of folds would fall in [0.787, 0.889]
- 95% of folds would fall in [0.736, 0.940]
With only six folds, normality is a rough assumption. But the rule gives you an intuition for how much variance to expect.
Two Models, Same Mean, Different Spread
This is the scenario where standard deviation matters most. Compare Model A (accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88], std = 0.051) with Model B (accuracy_b = [0.84, 0.83, 0.84, 0.83, 0.84, 0.84], std = 0.005):
Same mean, completely different reliability profile. Model B is the obvious deployment choice if consistency matters — and it almost always does.
Coefficient of Variation: Comparing Spread Across Scales
When comparing variability across datasets with different scales or means, normalize by the mean:
For Model A:
If you also measured precision scores with mean 0.73 and std 0.051, the raw standard deviations look the same. But the CV for precision would be — slightly more variable relative to its scale. The CV makes this comparison possible.
MAD as a Robust Alternative
Standard deviation is sensitive to outliers because it squares deviations. One fold returning 0.40 instead of ~0.83 inflates the standard deviation severely. The Median Absolute Deviation (MAD) is resistant to this:
For our clean dataset, MAD = 0.045. If a corrupted fold giving 0.40 were included, MAD would shift modestly while standard deviation would roughly double. When you suspect data quality issues in your CV setup, report MAD alongside standard deviation.
Python Example
import numpy as np
from scipy import stats
accuracy = np.array([0.82, 0.79, 0.91, 0.85, 0.78, 0.88])
std_sample = np.std(accuracy, ddof=1)
std_pop = np.std(accuracy, ddof=0)
cv = (std_sample / np.mean(accuracy)) * 100
mad = np.median(np.abs(accuracy - np.median(accuracy)))
print(f"Sample std dev (ddof=1): {std_sample:.4f}")
print(f"Population std dev (ddof=0): {std_pop:.4f}")
print(f"Coefficient of Variation: {cv:.2f}%")
print(f"Median Absolute Deviation: {mad:.4f}")
mean_acc = np.mean(accuracy)
print(f"\n68% range: [{mean_acc - std_sample:.3f}, {mean_acc + std_sample:.3f}]")
print(f"95% range: [{mean_acc - 2*std_sample:.3f}, {mean_acc + 2*std_sample:.3f}]")Sample std dev (ddof=1): 0.0512
Population std dev (ddof=0): 0.0467
Coefficient of Variation: 6.11%
Median Absolute Deviation: 0.0450
68% range: [0.787, 0.889]
95% range: [0.736, 0.940]
Calculation Trace
| Phase | Formula | Values | Result |
|---|---|---|---|
| Mean | |||
| Sum of squared deviations | Six squared differences | ||
| Sample variance | |||
| Sample std dev | Square root | ||
| CV | Normalize by mean | ||
| MAD | Median of abs deviations from 0.835 | Sort and take middle |
Related Concepts
This post completes the four-way picture of dispersion: range, IQR, variance, standard deviation. The previous posts established why we divide by (Bessel's correction) and why the mean is used as the reference point. Standard deviation is also the building block for z-scores — which measure how many standard deviations a value is from the mean — and for confidence intervals: . From here, the series moves to histograms, which let you see whether the normal distribution assumption embedded in the 68-95-99.7 rule is actually reasonable for your data.
When This Breaks Down
The standard deviation is only a good summary of spread for unimodal, roughly symmetric distributions. If Model A's CV folds split into two clusters — three folds at 0.91 and three at 0.78 — the standard deviation would be 0.065, which would suggest "moderate variability." But the actual situation is bimodal: the model works very differently on two types of data splits. Histograms would reveal this; a single standard deviation would not. For non-normal data with fewer than 15 observations, bootstrap confidence intervals on the standard deviation give a better sense of estimation uncertainty than relying on the point estimate alone.
Test Your Understanding
-
Compute the sample standard deviation of
accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88]step by step. Verify your answer against the Python output above. -
A new model achieves
accuracy_c = [0.70, 0.71, 0.70, 0.70, 0.71, 0.70]. This model has lower mean accuracy than Model A (0.838 vs 0.704) but much lower standard deviation. Under what business conditions would you prefer Model C over Model A? -
Two models have identical standard deviations of 0.05. Model X has mean accuracy 0.92. Model Y has mean accuracy 0.60. Compute the coefficient of variation for each. Which model is more relatively variable, and why does this matter?
-
If you add a seventh fold to
accuracywith value 0.50 (a corrupted fold), how does the standard deviation change? Now compute the MAD for the seven-value dataset. Which measure better represents the variability of the six good folds?
How do histograms help us visualize this distribution?
Previous: Why Sample Variance Is Divided By n-1? | Next: Histograms