Back to blog
← View series: statistics

~/blog

Mean, Median, and Mode

Apr 11, 202610 min readBy mohammed.vasim
StatisticsMathData Science

When you evaluate a model across multiple cross-validation folds, the first question is: what single number should represent its performance? That question is not as simple as it sounds. The wrong choice of central tendency measure can make a mediocre model look strong, or a strong model look unstable.

Central tendency is about finding a single number that represents what is typical in your dataset. But "typical" is not always obvious, and the choice of how you measure it affects conclusions in ways that are not always intuitive.

The Anchor Dataset

Throughout this post, every calculation uses six cross-validation accuracy scores from a classifier:

accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88]

These represent six folds of cross-validation on a binary classifier. Each value is the fraction of test examples classified correctly in that fold.

The Mean (Arithmetic Average)

The arithmetic mean adds everything up and divides by the count:

Step 1 — Sum the accuracy scores:

0.82 + 0.79 + 0.91 + 0.85 + 0.78 +… Sum = 5.03 (all six values)

Step 2 — Divide by n = 6:

Sum = 5.03 ÷ n = 6 = mean = 0.838 Arithmetic mean of the six CV accuracy scores

The arithmetic mean is the balancing point: the sum of deviations above it equals the sum below. This makes it sensitive to outliers. If one fold returns 0.40 due to a data bug, the mean drops to 0.755 — not representative of the other five folds.

The Median (Middle Value)

Sort the scores, then take the middle value:

Sorted: [0.78, 0.79, 0.82, 0.85, 0.88, 0.91]

With six values (even count), the median is the average of positions 3 and 4:

0.78 0.79 0.82 0.85 0.88 0.91 Median = (0.82 + 0.85) / 2 = 0.835

The mean (0.838) and median (0.835) are very close, which tells you the distribution is nearly symmetric — no single fold is dramatically pulling the average.

Mean vs Median on Skewed Data

Now imagine a scenario where the model fails catastrophically on one fold — perhaps a fold happened to contain mostly rare-class examples:

accuracy_skewed = [0.82, 0.79, 0.91, 0.85, 0.78, 0.40]

Mean = (0.82 + 0.79 + 0.91 + 0.85 + 0.78 + 0.40) / 6 = 0.758

Median = (0.79 + 0.82) / 2 = 0.805

0.40 0.91 0.40 0.78 0.79 0.82 0.85 0.88 0.91 mean=0.758 median=0.805 One bad fold pulls the mean left; the median stays near the cluster

The mean (0.758) is now being dragged toward the outlier fold. The median (0.805) better represents the five typical folds. This is why median CV accuracy is sometimes preferred when you suspect occasional bad folds from data quality issues.

The Mode

The mode is the value that appears most often. For continuous CV scores like ours, no values repeat, so the mode is meaningless here. The mode is most useful for categorical data — for example, finding the most frequent prediction class or the most common error type.

Weighted Mean

Sometimes different observations should contribute unequally. In a multi-class classification problem, you might weight each class's accuracy by the class's size to avoid the majority class dominating your metric.

Consider three CV runs where the class sizes in each fold differ:

FoldAccuracyClass-size weight
10.820.4 (large fold)
20.790.2 (small fold)
30.880.4 (large fold)

This is the principle behind macro vs. weighted-average F1 scores. Macro-average treats every class equally. Weighted-average weights each class by its support count.

Geometric Mean

The arithmetic mean assumes you are adding quantities. When you are multiplying rates or compounding improvements, the geometric mean is the correct measure.

For our six accuracy scores:

The geometric mean (0.836) is slightly lower than the arithmetic mean (0.838). This is always true: the geometric mean is never larger than the arithmetic mean, and the gap grows when values vary more.

DS use case — epoch-over-epoch accuracy improvement: If your model's accuracy improves by factors of 1.05, 1.03, 1.02, 0.99, 1.04, 1.01 across six training phases, the average improvement rate is the geometric mean of those factors, not their arithmetic mean. The arithmetic mean would overestimate compound growth.

0.78 0.79 0.82 0.85 0.88 0.91 GM=0.836 AM=0.838 GM is always ≤ AM; the gap reflects variability across folds

Harmonic Mean

The harmonic mean is the correct average when you are averaging rates or ratios.

For our six accuracy scores:

The F1 score IS the harmonic mean of precision and recall. This is not a coincidence — it is why F1 was defined that way.

Wait — why multiply by 2? Because the harmonic mean of two values and is , and that is the standard F1 formula. The factor of 2 normalizes so that F1 = 1 when both precision and recall are 1.

The harmonic mean penalizes imbalance between precision and recall more than the arithmetic mean would. A model with precision = 1.0 and recall = 0.0 has an F1 of 0, not 0.5. This is the feature you want: a model that entirely sacrifices recall to achieve perfect precision should not score 50%.

Trimmed Mean

A 10% trimmed mean removes the top and bottom 10% of values before averaging. For our six scores, removing the lowest (0.78) and highest (0.91) and averaging the remaining four:

This is useful when you suspect a few CV folds were corrupted by data quality issues but you do not want to throw away data arbitrarily. The trimmed mean is a middle ground between the mean (uses everything) and the median (uses only the center).

When to Use Which

MeasureBest ForOutlier Resistant?
Arithmetic meanSymmetric, continuous dataNo
MedianSkewed data, suspected outliersYes
Geometric meanRates, compounding improvementsModerate
Harmonic meanAveraging ratios (precision/recall)Yes
Trimmed meanData with occasional bad foldsMostly
ModeCategorical data (error types, classes)N/A

Python Example

python
import numpy as np
from scipy import stats
from scipy.stats import hmean

accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88]

arith_mean = np.mean(accuracy)
median = np.median(accuracy)
geo_mean = stats.gmean(accuracy)
harm_mean = hmean(accuracy)
trimmed = stats.trim_mean(accuracy, 0.1)

print(f"Arithmetic mean: {arith_mean:.3f}")
print(f"Median:          {median:.3f}")
print(f"Geometric mean:  {geo_mean:.3f}")
print(f"Harmonic mean:   {harm_mean:.3f}")
print(f"Trimmed mean:    {trimmed:.3f}")

precision, recall = 0.88, 0.76
f1 = 2 * precision * recall / (precision + recall)
f1_via_hmean = 2 * hmean([precision, recall])
print(f"\nF1 score:        {f1:.3f}")
print(f"F1 via HM x2:    {f1_via_hmean:.3f}")
Arithmetic mean: 0.838 Median: 0.835 Geometric mean: 0.836 Harmonic mean: 0.836 Trimmed mean: 0.835 F1 score: 0.816 F1 via HM x2: 0.816

Calculation Trace

PhaseFormulaValuesResult
Arithmetic mean
MedianMiddle two values
Geometric mean
Harmonic mean
Trimmed mean (10%)Drop extremes, average rest

The previous posts established how to distinguish a population parameter from a sample statistic. The measures here — mean, median, geometric mean, harmonic mean — are descriptive statistics that summarize samples. The next post covers dispersion: knowing the center of your CV scores is only half the picture. A model with mean accuracy 0.838 and range 0.13 is very different from one with mean 0.838 and range 0.02. From here, the path leads to variance, standard deviation, and eventually to confidence intervals and hypothesis tests for comparing model performance.

When This Breaks Down

The arithmetic mean faithfully represents "typical" only when the distribution is roughly symmetric. For CV accuracy distributions, this usually holds. But if your dataset has severe class imbalance and you are not using stratified folds, one fold may return near-chance accuracy — pulling the mean far below what most folds show. In that case, report median accuracy alongside the mean, and always investigate outlier folds rather than averaging past them. With fewer than 5 observations, no single central tendency measure is reliable; report all fold values individually.

Test Your Understanding

  1. For accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88], the arithmetic mean is 0.838 and the geometric mean is 0.836. Why is the geometric mean always less than or equal to the arithmetic mean? Under what condition would they be equal?

  2. A model achieves precision = 0.95 on class A but recall = 0.20. Compute the F1 score. Now compute what would happen if you used the arithmetic mean of precision and recall instead. Why is the arithmetic mean misleading here?

  3. You run 6-fold CV and get accuracy = [0.84, 0.83, 0.85, 0.82, 0.84, 0.23]. The last fold clearly failed (data leakage was discovered). Compute the 10% trimmed mean. Does it appropriately handle this bad fold?

  4. Why does the harmonic mean penalize imbalance between precision and recall more severely than the arithmetic mean? Construct an example where precision = 1.0 and recall = 0.01, and show the arithmetic mean would give a misleading answer.

  5. Epoch-over-epoch accuracy improvements are: 1.05×, 0.98×, 1.06×, 1.02×, 0.99×, 1.04×. Should you use the arithmetic or geometric mean to find the average improvement rate? What goes wrong if you use the arithmetic mean?


The next topic is Measure of Dispersion — because knowing the center of your data is not enough; you need to know how spread out it is.


Previous: Population Vs Sample Data | Next: Measure of Dispersion

Comments (0)

No comments yet. Be the first to comment!

Leave a comment