← View series: statistics
~/blog
Mean, Median, and Mode
When you evaluate a model across multiple cross-validation folds, the first question is: what single number should represent its performance? That question is not as simple as it sounds. The wrong choice of central tendency measure can make a mediocre model look strong, or a strong model look unstable.
Central tendency is about finding a single number that represents what is typical in your dataset. But "typical" is not always obvious, and the choice of how you measure it affects conclusions in ways that are not always intuitive.
The Anchor Dataset
Throughout this post, every calculation uses six cross-validation accuracy scores from a classifier:
accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88]
These represent six folds of cross-validation on a binary classifier. Each value is the fraction of test examples classified correctly in that fold.
The Mean (Arithmetic Average)
The arithmetic mean adds everything up and divides by the count:
Step 1 — Sum the accuracy scores:
Step 2 — Divide by n = 6:
The arithmetic mean is the balancing point: the sum of deviations above it equals the sum below. This makes it sensitive to outliers. If one fold returns 0.40 due to a data bug, the mean drops to 0.755 — not representative of the other five folds.
The Median (Middle Value)
Sort the scores, then take the middle value:
Sorted: [0.78, 0.79, 0.82, 0.85, 0.88, 0.91]
With six values (even count), the median is the average of positions 3 and 4:
The mean (0.838) and median (0.835) are very close, which tells you the distribution is nearly symmetric — no single fold is dramatically pulling the average.
Mean vs Median on Skewed Data
Now imagine a scenario where the model fails catastrophically on one fold — perhaps a fold happened to contain mostly rare-class examples:
accuracy_skewed = [0.82, 0.79, 0.91, 0.85, 0.78, 0.40]
Mean = (0.82 + 0.79 + 0.91 + 0.85 + 0.78 + 0.40) / 6 = 0.758
Median = (0.79 + 0.82) / 2 = 0.805
The mean (0.758) is now being dragged toward the outlier fold. The median (0.805) better represents the five typical folds. This is why median CV accuracy is sometimes preferred when you suspect occasional bad folds from data quality issues.
The Mode
The mode is the value that appears most often. For continuous CV scores like ours, no values repeat, so the mode is meaningless here. The mode is most useful for categorical data — for example, finding the most frequent prediction class or the most common error type.
Weighted Mean
Sometimes different observations should contribute unequally. In a multi-class classification problem, you might weight each class's accuracy by the class's size to avoid the majority class dominating your metric.
Consider three CV runs where the class sizes in each fold differ:
| Fold | Accuracy | Class-size weight |
|---|---|---|
| 1 | 0.82 | 0.4 (large fold) |
| 2 | 0.79 | 0.2 (small fold) |
| 3 | 0.88 | 0.4 (large fold) |
This is the principle behind macro vs. weighted-average F1 scores. Macro-average treats every class equally. Weighted-average weights each class by its support count.
Geometric Mean
The arithmetic mean assumes you are adding quantities. When you are multiplying rates or compounding improvements, the geometric mean is the correct measure.
For our six accuracy scores:
The geometric mean (0.836) is slightly lower than the arithmetic mean (0.838). This is always true: the geometric mean is never larger than the arithmetic mean, and the gap grows when values vary more.
DS use case — epoch-over-epoch accuracy improvement: If your model's accuracy improves by factors of 1.05, 1.03, 1.02, 0.99, 1.04, 1.01 across six training phases, the average improvement rate is the geometric mean of those factors, not their arithmetic mean. The arithmetic mean would overestimate compound growth.
Harmonic Mean
The harmonic mean is the correct average when you are averaging rates or ratios.
For our six accuracy scores:
The F1 score IS the harmonic mean of precision and recall. This is not a coincidence — it is why F1 was defined that way.
Wait — why multiply by 2? Because the harmonic mean of two values and is , and that is the standard F1 formula. The factor of 2 normalizes so that F1 = 1 when both precision and recall are 1.
The harmonic mean penalizes imbalance between precision and recall more than the arithmetic mean would. A model with precision = 1.0 and recall = 0.0 has an F1 of 0, not 0.5. This is the feature you want: a model that entirely sacrifices recall to achieve perfect precision should not score 50%.
Trimmed Mean
A 10% trimmed mean removes the top and bottom 10% of values before averaging. For our six scores, removing the lowest (0.78) and highest (0.91) and averaging the remaining four:
This is useful when you suspect a few CV folds were corrupted by data quality issues but you do not want to throw away data arbitrarily. The trimmed mean is a middle ground between the mean (uses everything) and the median (uses only the center).
When to Use Which
| Measure | Best For | Outlier Resistant? |
|---|---|---|
| Arithmetic mean | Symmetric, continuous data | No |
| Median | Skewed data, suspected outliers | Yes |
| Geometric mean | Rates, compounding improvements | Moderate |
| Harmonic mean | Averaging ratios (precision/recall) | Yes |
| Trimmed mean | Data with occasional bad folds | Mostly |
| Mode | Categorical data (error types, classes) | N/A |
Python Example
import numpy as np
from scipy import stats
from scipy.stats import hmean
accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88]
arith_mean = np.mean(accuracy)
median = np.median(accuracy)
geo_mean = stats.gmean(accuracy)
harm_mean = hmean(accuracy)
trimmed = stats.trim_mean(accuracy, 0.1)
print(f"Arithmetic mean: {arith_mean:.3f}")
print(f"Median: {median:.3f}")
print(f"Geometric mean: {geo_mean:.3f}")
print(f"Harmonic mean: {harm_mean:.3f}")
print(f"Trimmed mean: {trimmed:.3f}")
precision, recall = 0.88, 0.76
f1 = 2 * precision * recall / (precision + recall)
f1_via_hmean = 2 * hmean([precision, recall])
print(f"\nF1 score: {f1:.3f}")
print(f"F1 via HM x2: {f1_via_hmean:.3f}")Arithmetic mean: 0.838
Median: 0.835
Geometric mean: 0.836
Harmonic mean: 0.836
Trimmed mean: 0.835
F1 score: 0.816
F1 via HM x2: 0.816
Calculation Trace
| Phase | Formula | Values | Result |
|---|---|---|---|
| Arithmetic mean | |||
| Median | Middle two values | ||
| Geometric mean | |||
| Harmonic mean | |||
| Trimmed mean (10%) | Drop extremes, average rest |
Related Concepts
The previous posts established how to distinguish a population parameter from a sample statistic. The measures here — mean, median, geometric mean, harmonic mean — are descriptive statistics that summarize samples. The next post covers dispersion: knowing the center of your CV scores is only half the picture. A model with mean accuracy 0.838 and range 0.13 is very different from one with mean 0.838 and range 0.02. From here, the path leads to variance, standard deviation, and eventually to confidence intervals and hypothesis tests for comparing model performance.
When This Breaks Down
The arithmetic mean faithfully represents "typical" only when the distribution is roughly symmetric. For CV accuracy distributions, this usually holds. But if your dataset has severe class imbalance and you are not using stratified folds, one fold may return near-chance accuracy — pulling the mean far below what most folds show. In that case, report median accuracy alongside the mean, and always investigate outlier folds rather than averaging past them. With fewer than 5 observations, no single central tendency measure is reliable; report all fold values individually.
Test Your Understanding
-
For
accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88], the arithmetic mean is 0.838 and the geometric mean is 0.836. Why is the geometric mean always less than or equal to the arithmetic mean? Under what condition would they be equal? -
A model achieves precision = 0.95 on class A but recall = 0.20. Compute the F1 score. Now compute what would happen if you used the arithmetic mean of precision and recall instead. Why is the arithmetic mean misleading here?
-
You run 6-fold CV and get
accuracy = [0.84, 0.83, 0.85, 0.82, 0.84, 0.23]. The last fold clearly failed (data leakage was discovered). Compute the 10% trimmed mean. Does it appropriately handle this bad fold? -
Why does the harmonic mean penalize imbalance between precision and recall more severely than the arithmetic mean? Construct an example where precision = 1.0 and recall = 0.01, and show the arithmetic mean would give a misleading answer.
-
Epoch-over-epoch accuracy improvements are: 1.05×, 0.98×, 1.06×, 1.02×, 0.99×, 1.04×. Should you use the arithmetic or geometric mean to find the average improvement rate? What goes wrong if you use the arithmetic mean?
The next topic is Measure of Dispersion — because knowing the center of your data is not enough; you need to know how spread out it is.
Previous: Population Vs Sample Data | Next: Measure of Dispersion