~/blog

Central Tendency

Apr 11, 2026•17 min read•By Mohammed Vasim

StatisticsMathData Science

When you evaluate a model across multiple cross-validation folds, the first question is: what single number should represent its performance? That question is not as simple as it sounds. The wrong choice of central tendency measure can make a mediocre model look strong, or a strong model look unstable.

Central tendency is about finding a single number that represents what is typical in your dataset. But "typical" is not always obvious, and the choice of how you measure it affects conclusions in ways that are not always intuitive.

The Anchor Dataset

Throughout this post, every calculation uses six cross-validation accuracy scores from a classifier:

python

accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88]

text

[0.82, 0.79, 0.91, 0.85, 0.78, 0.88]

These represent six folds of cross-validation on a binary classifier. Each value is the fraction of test examples classified correctly in that fold.

The Mean (Arithmetic Average)

The arithmetic mean adds everything up and divides by the count:

$\overset{x}{ˉ} = \frac{\sum _{i = 1}^{n} x _{i}}{n}$

Step 1 — Sum the accuracy scores:

$0.82 + 0.79 + 0.91 + 0.85 + 0.78 + 0.88 = 5.03$

Step 2 — Divide by n = 6:

$\overset{x}{ˉ} = \frac{5.03}{6} \approx 0.838$

The arithmetic mean is the balancing point: the sum of deviations above it equals the sum below. This makes it sensitive to outliers. If one fold returns 0.40 due to a data bug, the mean drops to 0.755 — not representative of the other five folds.

The Median (Middle Value)

Sort the scores, then pick the middle. It only sees rank, not magnitude — which makes it immune to outliers. A fold that returns 0.01 moves from position 5 to position 1 in the sorted order, but the middle value stays the same.

Even count (6 values): Sorted: [0.78, 0.79, 0.82, 0.85, 0.88, 0.91]

With six values, there is no single middle. Average positions 3 and 4:

$Median = \frac{0.82 + 0.85}{2} = 0.835$

Odd count (5 values): Drop the lowest-accuracy fold (0.78 — assume it was excluded from a rerun): [0.79, 0.82, 0.85, 0.88, 0.91]

With five values, position 3 is the single middle:

$Median = 0.85$

The mean (0.838) and median (0.835) are very close on the original six scores, which tells you the distribution is nearly symmetric — no single fold is dramatically pulling the average.

Mean vs Median on Skewed Data

Now imagine a scenario where the model fails catastrophically on one fold — perhaps a fold happened to contain mostly rare-class examples:

accuracy_skewed = [0.82, 0.79, 0.91, 0.85, 0.78, 0.40]

Mean = (0.82 + 0.79 + 0.91 + 0.85 + 0.78 + 0.40) / 6 = 0.758

Median = (0.79 + 0.82) / 2 = 0.805

The mean (0.758) is now being dragged toward the outlier fold. The median (0.805) better represents the five typical folds. This is why median CV accuracy is sometimes preferred when you suspect occasional bad folds from data quality issues.

The Mode

The mode is the value that appears most often. For continuous CV scores like ours, no values repeat — mode is undefined on that data. Mode becomes meaningful with categorical data or discrete counts. To show it, we need a sub-dataset where repetition exists.

Anchor exception — model-serving error types: To show mode, we need categorical data. Here are error types from a model-serving log (10 events):

python

error_types = ["timeout", "null_ptr", "timeout", "oom", "timeout",
               "null_ptr", "null_ptr", "timeout", "null_ptr", "timeout"]

Frequency count:

Error type	Count
timeout	5
null_ptr	4
oom	1

The mode is timeout — it occurs more than any other value.

Three cases to know:

Unimodal — one clear mode (our example: timeout). This is the normal case; there is one dominant category.

Bimodal — two categories tie for highest frequency. If timeout and null_ptr both appeared 5 times, both are modes and the dataset is bimodal. Bimodal data often signals two distinct failure populations — for example, two different server configurations producing different error patterns. Reporting a single mode would hide this.

No mode — all values appear exactly once. If each of the 10 errors were a different type, mode is undefined — not zero, not all of them, just undefined.

Why statistics.multimode() instead of statistics.mode(): In Python < 3.8, statistics.mode() raises StatisticsError when two or more values tie. In Python 3.8+, it returns the first one found. statistics.multimode() always returns a list and handles ties correctly — the right tool when you need to detect bimodality.

python

import statistics

error_types = ["timeout", "null_ptr", "timeout", "oom", "timeout",
               "null_ptr", "null_ptr", "timeout", "null_ptr", "timeout"]

modes = statistics.multimode(error_types)
print(f"Mode(s): {modes}")

bimodal = ["timeout", "null_ptr", "timeout", "null_ptr", "oom"]
print(f"Bimodal case: {statistics.multimode(bimodal)}")

no_mode = ["timeout", "null_ptr", "oom", "rate_limit", "memory"]
print(f"No mode case: {statistics.multimode(no_mode)}")

text

Mode(s): ['timeout']
Bimodal case: ['timeout', 'null_ptr']
No mode case: ['timeout', 'null_ptr', 'oom', 'rate_limit', 'memory']

Mode is most useful for: the most frequent prediction class in a multi-class classifier, the dominant error type from a model-serving log, the most common hyperparameter value discovered by a random search.

Weighted Mean

Sometimes different observations should contribute unequally. In a multi-class classification problem, you might weight each class's accuracy by the class's size to avoid the majority class dominating your metric.

$\overset{x}{ˉ}_{w} = \frac{\sum w _{i} \cdot x _{i}}{\sum w _{i}}$

Consider three CV runs where the class sizes in each fold differ:

Fold	Accuracy	Class-size weight
1	0.82	0.4 (large fold)
2	0.79	0.2 (small fold)
3	0.88	0.4 (large fold)

$\overset{x}{ˉ}_{w} = \frac{( 0.4 \times 0.82 ) + ( 0.2 \times 0.79 ) + ( 0.4 \times 0.88 )}{0.4 + 0.2 + 0.4} = \frac{0.328 + 0.158 + 0.352}{1.0} = 0.838$

This is the principle behind macro vs. weighted-average F1 scores. Macro-average treats every class equally — it IS the arithmetic mean of per-class F1 scores. Weighted-average weights each class by its support count — it IS the weighted mean of per-class F1 scores. The two numbers can diverge significantly on imbalanced datasets.

Geometric Mean

The arithmetic mean assumes you are adding quantities. When you are multiplying rates or compounding improvements, the geometric mean is the correct measure.

$GM = (\prod_{i = 1}^{n} x_{i})^{1/ n} = n x_{1} \cdot x_{2} \dots x_{n}$

For our six accuracy scores:

Step 1 — Multiply all six values:

$GM = 0.82 \times 0.79 \times 0.91 \times 0.85 \times 0.78 \times 0.88 = 0.3265$

Step 2 — Take the sixth root:

$GM = (0.3265)^{1/6} \approx 0.836$

The geometric mean (0.836) is slightly lower than the arithmetic mean (0.838). The geometric mean is never larger than the arithmetic mean — this is always true. The intuition: when values vary, their product shrinks relative to their sum. When two values are equal, say both 0.85, you get 0.85 × 0.85 = 0.7225 and the square root gives exactly 0.85, matching the arithmetic mean. But replace one with 0.70 and the other with 1.00 (same arithmetic mean of 0.85): the product is 0.70, and its square root is 0.837 — already below 0.85. The further apart the values, the more the product undershoots the sum. Equal values → GM = AM; variability makes GM < AM.

DS use case — epoch-over-epoch accuracy improvement: If your model's accuracy improves by factors of 1.05, 1.03, 1.02, 0.99, 1.04, 1.01 across six training phases, the average improvement rate is the geometric mean of those factors, not their arithmetic mean. The arithmetic mean would overestimate compound growth.

Harmonic Mean

The harmonic mean is the correct average when you are averaging rates or ratios.

$H M = \frac{n}{\sum _{i = 1}^{n} \frac{1}{x _{i}}}$

For our six accuracy scores:

$H M = \frac{6}{\frac{1}{0.82} + \frac{1}{0.79} + \frac{1}{0.91} + \frac{1}{0.85} + \frac{1}{0.78} + \frac{1}{0.88}}$

$H M = \frac{6}{1.220 + 1.266 + 1.099 + 1.176 + 1.282 + 1.136} = \frac{6}{7.179} \approx 0.836$

The F1 score IS the harmonic mean of precision and recall. This is not a coincidence — it is why F1 was defined that way.

$F_{1} = \frac{2}{\frac{1}{Precision} + \frac{1}{Recall}} = H M (Precision, Recall)$

No extra factor needed. For two values $p$ and $r$ , the harmonic mean formula gives $\frac{2}{1/ p + 1/ r} = \frac{2 p r}{p + r}$ , which is exactly F1. When both are 1, $H M (1, 1) = 1$ .

The harmonic mean penalizes imbalance between precision and recall more than the arithmetic mean would. Consider a model that finds almost nothing but never makes a mistake: precision = 1.0, recall = 0.01.

Arithmetic mean: $(1.0 + 0.01) /2 = 0.505$ — looks like a decent model
Harmonic mean: $2 \times 1.0 \times 0.01/ (1.0 + 0.01) = 0.02/1.01 \approx 0.0198$ — near zero

The HM is near zero because recall is near zero, which is the correct signal: a model that sacrifices recall almost entirely to achieve perfect precision is nearly useless. The arithmetic mean's 0.505 would hide that. This is the feature you want from F1.

Trimmed Mean

A 10% trimmed mean removes the top and bottom 10% of values before averaging. For our six scores, removing the lowest (0.78) and highest (0.91) and averaging the remaining four:

$\overset{x}{ˉ}_{10%} = \frac{0.79 + 0.82 + 0.85 + 0.88}{4} = \frac{3.34}{4} = 0.835$

This is useful when you suspect a few CV folds were corrupted by data quality issues but you do not want to throw away data arbitrarily. The trimmed mean is a middle ground between the mean (uses everything) and the median (uses only the center). Use it when you can articulate a percentage to trim rather than hard-excluding specific folds.

When to Use Which

Measure	Best For	Outlier Resistant?	DS / ML Use Case
Arithmetic mean	Symmetric, continuous data	No	Average CV accuracy, average loss per epoch
Median	Skewed data, suspected outliers	Yes	Median CV accuracy when one fold is corrupted
Mode	Categorical or discrete data	N/A	Most frequent prediction class, dominant error type from serving logs
Weighted mean	Observations with unequal importance	No	Weighted-average F1 weighted by class support
Geometric mean	Rates, compounding improvements	Moderate	Average epoch-over-epoch improvement factor
Harmonic mean	Averaging ratios	Yes	F1 score (harmonic mean of precision and recall)
Trimmed mean	Data with occasional bad folds	Mostly	CV accuracy when a fixed % of folds may be corrupted

Connection to Loss Functions

Central tendency measures are not arbitrary summaries — each one is the minimizer of a specific loss function. This is a mathematical identity, not an analogy, and it explains why the choice of loss function implicitly decides what your model predicts.

Mean minimizes squared error (MSE). The value of $c$ that minimizes $\sum (x_{i} - c)^{2}$ is the arithmetic mean. Take the derivative with respect to $c$ , set it to zero: $- 2 \sum (x_{i} - c) = 0$ , so $c = \sum x_{i} / n = \overset{x}{ˉ}$ . Every regression model trained on MSE is predicting the conditional mean of $y$ given $x$ .

To verify: for the accuracy anchor, the mean is 0.838. Compare MSE at three candidate values:

$c$	$\sum (x_{i} - c)^{2}$
0.830	0.002584
0.838	0.002566
0.845	0.002604

The mean (0.838) produces the smallest sum of squared deviations.

Median minimizes absolute error (MAE). The value of $c$ that minimizes $\sum ∣ x_{i} - c ∣$ is the median. The intuition: if you move $c$ slightly away from the median, the number of values that get further away exceeds the number that get closer (because the median splits the count evenly). Every regression model trained on MAE is predicting the conditional median — this is why MAE-trained models are more robust to outlier targets.

Mode minimizes 0-1 loss. The value of $c$ that minimizes $\sum 1 (x_{i} \neq = c)$ — the count of mismatches — is the mode. The mode is the value that is "wrong" least often. Every multi-class classifier trained on cross-entropy is predicting the conditional mode (most probable class).

The upshot: when you choose a loss function for a regression problem, you are not just picking an optimization target — you are deciding which measure of central tendency your model will estimate. MSE → mean. MAE → median. This is why switching from MSE to MAE makes regression models less sensitive to outlier labels: you are asking the model to track the median instead of the mean.

Python Example

python

import numpy as np
from scipy import stats
from scipy.stats import hmean

accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88]

arith_mean = np.mean(accuracy)
median = np.median(accuracy)
geo_mean = stats.gmean(accuracy)
harm_mean = hmean(accuracy)
trimmed = stats.trim_mean(accuracy, 0.1)

print(f"Arithmetic mean: {arith_mean:.3f}")
print(f"Median:          {median:.3f}")
print(f"Geometric mean:  {geo_mean:.3f}")
print(f"Harmonic mean:   {harm_mean:.3f}")
print(f"Trimmed mean:    {trimmed:.3f}")

precision, recall = 0.88, 0.76
f1 = 2 * precision * recall / (precision + recall)
f1_via_hmean = hmean([precision, recall])
print(f"\nF1 score:        {f1:.3f}")
print(f"F1 via HM:       {f1_via_hmean:.3f}")

text

Arithmetic mean: 0.838
Median:          0.835
Geometric mean:  0.836
Harmonic mean:   0.836
Trimmed mean:    0.835

F1 score:        0.816
F1 via HM:       0.816

Calculation Trace

Phase	Formula	Values	Result
Arithmetic mean	$\sum x_{i} / n$	$5.03/6$	$0.838$
Median	Middle two values	$(0.82 + 0.85) /2$	$0.835$
Mode	Most frequent value	timeout appears 5 times	timeout
Weighted mean	$\sum w_{i} x_{i} / \sum w_{i}$	$(0.4 \times 0.82 + 0.2 \times 0.79 + 0.4 \times 0.88) /1.0$	$0.838$
Geometric mean	$(x_{1} \dots x_{n})^{1/ n}$	$(0.3265)^{1/6}$	$0.836$
Harmonic mean	$n / \sum (1/ x_{i})$	$6/7.179$	$0.836$
Trimmed mean (10%)	Drop extremes, average rest	$(0.79 + 0.82 + 0.85 + 0.88) /4$	$0.835$

The previous posts established how to distinguish a population parameter from a sample statistic. The measures here — mean, median, geometric mean, harmonic mean — are descriptive statistics that summarize samples. The next post covers dispersion: knowing the center of your CV scores is only half the picture. A model with mean accuracy 0.838 and range 0.13 is very different from one with mean 0.838 and range 0.02. From here, the path leads to variance, standard deviation, and eventually to confidence intervals and hypothesis tests for comparing model performance.

When This Breaks Down

The arithmetic mean faithfully represents "typical" only when the distribution is roughly symmetric. For CV accuracy distributions, this usually holds. But if your dataset has severe class imbalance and you are not using stratified folds, one fold may return near-chance accuracy — pulling the mean far below what most folds show. In that case, report median accuracy alongside the mean, and always investigate outlier folds rather than averaging past them. With fewer than 5 observations, no single central tendency measure is reliable; report all fold values individually.

Test Your Understanding

For accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88], the arithmetic mean is 0.838 and the geometric mean is 0.836. Why is the geometric mean always less than or equal to the arithmetic mean? Under what condition would they be equal?
A model achieves precision = 0.95 on class A but recall = 0.20. Compute the F1 score. Now compute what would happen if you used the arithmetic mean of precision and recall instead. Why is the arithmetic mean misleading here?
You run 6-fold CV and get accuracy = [0.84, 0.83, 0.85, 0.82, 0.84, 0.23]. The last fold clearly failed (data leakage was discovered). Compute the 10% trimmed mean. Does it appropriately handle this bad fold?
Why does the harmonic mean penalize imbalance between precision and recall more severely than the arithmetic mean? Construct an example where precision = 1.0 and recall = 0.01, and show the arithmetic mean would give a misleading answer.
Epoch-over-epoch accuracy improvements are: 1.05×, 0.98×, 1.06×, 1.02×, 0.99×, 1.04×. Should you use the arithmetic or geometric mean to find the average improvement rate? What goes wrong if you use the arithmetic mean?

Central Tendency

The Anchor Dataset

The Mean (Arithmetic Average)

The Median (Middle Value)

Mean vs Median on Skewed Data

The Mode

Weighted Mean

Geometric Mean

Harmonic Mean

Trimmed Mean

When to Use Which

Connection to Loss Functions

Python Example

Calculation Trace

When This Breaks Down

Test Your Understanding

Comments (0)

Leave a comment

Central Tendency

The Anchor Dataset

The Mean (Arithmetic Average)

The Median (Middle Value)

Mean vs Median on Skewed Data

The Mode

Weighted Mean

Geometric Mean

Harmonic Mean

Trimmed Mean

When to Use Which

Connection to Loss Functions

Python Example

Calculation Trace

Related Concepts

When This Breaks Down

Test Your Understanding

Comments (0)

Leave a comment