Back to blog
← View series: statistics

~/blog

Percentiles and Quartiles

Jun 14, 202615 min readBy Mohammed Vasim
StatisticsMathData Science

The mean tells you where a distribution is centered. Standard deviation tells you how spread out it is. But neither answers the question you usually care most about: where does this specific value rank?

A fold accuracy of 0.91 might be impressive or merely average — you cannot tell from the number alone. Percentiles answer exactly this: they turn raw values into positions, letting you see any data point relative to the rest of the distribution. Quartiles apply the same idea at three fixed positions (25%, 50%, 75%) to give the distribution's skeletal shape at a glance.

The Anchor Dataset

Every calculation in this post uses the same six cross-validation accuracy scores from a binary classifier:

python
accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88]

Sorted ascending: [0.78, 0.79, 0.82, 0.85, 0.88, 0.91]

Six folds, six values. Small enough to trace by hand, realistic enough to mean something.

Phase 1 — Sort and Assign Positions

Percentiles are rank-based statistics. Before computing any percentile, you need the values arranged from smallest to largest. The raw order in which folds were evaluated is irrelevant — what matters is each value's rank in the sorted sequence.

Positions are zero-indexed (0 through 5) because that is how the interpolation formula works. Position 0 holds the lowest value; position 5 holds the highest.

0.78 pos 0 0.79 pos 1 0.82 pos 2 0.85 pos 3 0.88 pos 4 0.91 pos 5

The rank-based nature of percentiles is what makes them robust. A corrupted fold returning 0.50 instead of 0.82 would devastate the mean, but the 25th percentile only cares that 0.50 sits at position 0 — one step left on the scale, not a disaster rippling through every other calculation.

Phase 2 — Computing a Percentile by Interpolation

The standard method (used by NumPy's default method='linear') computes a real-valued position L within the sorted array, then interpolates between the two nearest integers:

The fractional part of L tells you how far to travel between the two surrounding values. If L lands exactly on an integer, no interpolation is needed — you return that value directly.

Worked example — the 75th percentile:

The floor is 3, the fractional part is 0.75. Read off the values at positions 3 and 4:

The 75th percentile of these six folds is 0.8725 — 75% of folds scored below that value.

0.78 pos 0 0.79 pos 1 0.82 pos 2 0.85 pos 3 (floor) 0.88 pos 4 0.91 pos 5 P75 = 0.8725

There is also an older textbook formula that uses L = p/100 × (n + 1) with 1-based indexing. Both are valid interpolation strategies — they just disagree on how to handle positions near the edges. NumPy, pandas, and SciPy all default to the zero-indexed form above. Use that when working in Python.

Phase 3 — Quartiles Q1, Q2, Q3

Quartiles are just percentiles at three specific values: Q1 = P25, Q2 = P50, Q3 = P75. Nothing new mechanically — the same position formula applies.

Q1 (25th percentile):

Q2 (50th percentile — the median):

Q3 (75th percentile):

Q1 = 0.7975 means 25% of folds scored below that accuracy. Q2 = 0.835 is the median. Q3 = 0.8725 means 75% of folds scored below it. These three numbers divide the sorted distribution into four roughly equal slices.

0.78 0.79 0.82 0.85 0.88 0.91 Q1=0.7975 Q2=0.835 Q3=0.8725

The three quartiles divide the sorted distribution into four equal parts

Phase 4 — IQR and the Box-Plot Model

The interquartile range is the distance between Q3 and Q1:

It measures the spread of the middle 50% of the data. By dropping the bottom 25% and top 25%, IQR ignores the extremes entirely. Whether the lowest fold scored 0.78 or 0.50, the IQR is unchanged as long as the quartile positions stay the same. That is the key distinction from standard deviation, which squares all deviations and gives extra weight to extreme values.

For the six accuracy folds, IQR = 0.075 means the central half of fold scores spans 7.5 percentage points. That is a narrow window — this model is fairly consistent across folds.

The box-and-whisker diagram makes this visual. The box spans Q1 to Q3 (the IQR). The median line sits inside the box. The whiskers extend outward to the most extreme values that still fall within the outlier bounds (computed in the next section). Any point beyond the whiskers is plotted individually as an outlier dot.

0.78 0.91 Q1=0.7975 Q2=0.835 Q3=0.8725

Box = IQR = 0.075

Phase 5 — Outlier Detection with Tukey Fences

The 1.5 × IQR rule (proposed by statistician John Tukey in 1977) defines "fences" beyond which values are flagged as potential outliers:

Every value in the dataset: 0.78, 0.79, 0.82, 0.85, 0.88, 0.91 — all fall within [0.685, 0.985]. No outliers.

Now suppose a seventh fold produced 0.50 — perhaps a data-loading bug caused a bad train/test split. That value falls below 0.685 and gets flagged immediately. The IQR-based fence caught it without needing to assume any distribution shape.

The 1.5 multiplier is a convention, not a mathematical law. For roughly normal data it flags about 0.7% of values as outliers. Some analysts use 3.0 × IQR for "extreme" outliers while treating values between 1.5× and 3.0× as merely suspicious.

Lower=0.685 Upper=0.985 0.50 outlier 0.78 0.79 0.82 0.85 0.88 0.91

Phase 6 — Percentile Rank

Percentile rank answers the reverse question: given a value, what percentage of the dataset falls below it?

The "strictly below" convention means the value itself is not counted. This gives you its rank as a percentage.

For the fold with accuracy = 0.91, five other values fall below it:

The 0.91 fold sits at the 83rd percentile of this six-fold run. If you compared this model against a benchmark of 50 other models and its accuracy landed at the 83rd percentile of that broader distribution, you would know it outperforms 83% of the competition.

Here are the percentile ranks for all six folds:

0.78 0.0% 0 below 0.79 16.7% 1 below 0.82 33.3% 2 below 0.85 50.0% 3 below 0.88 66.7% 4 below 0.91 83.3% 5 below

A Note on Calculation Methods

NumPy provides nine interpolation methods for percentiles. The differences are most visible with small datasets:

python
accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88]

import statistics

data = sorted(accuracy)
n = len(data)

def pctile_linear(p):
    L = (p / 100) * (n - 1)
    lo = int(L)
    frac = L - lo
    if lo + 1 < n:
        return data[lo] + frac * (data[lo + 1] - data[lo])
    return data[lo]

def pctile_lower(p):
    L = (p / 100) * (n - 1)
    return data[int(L)]

def pctile_midpoint(p):
    L = (p / 100) * (n - 1)
    lo = int(L)
    hi = min(lo + 1, n - 1)
    return (data[lo] + data[hi]) / 2

for p in [25, 75]:
    lin = pctile_linear(p)
    low = pctile_lower(p)
    mid = pctile_midpoint(p)
    print(f"P{p:2d} — linear: {lin:.4f}  lower: {low:.4f}  midpoint: {mid:.4f}")
text
P25 — linear: 0.7975  lower: 0.7900  midpoint: 0.8050
P75 — linear: 0.8725  lower: 0.8500  midpoint: 0.8650

Three methods, three answers. The linear method interpolates between neighbors and is the NumPy/pandas default. The lower method returns the actual data value just below the target position. The midpoint method averages the two surrounding values. For regulatory reporting or clinical thresholds, always document which method you used.

Python: Full Computation

python
accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88]

sorted_acc = sorted(accuracy)
n = len(sorted_acc)

def pctile(arr, p):
    import math
    L = (p / 100) * (len(arr) - 1)
    lo = math.floor(L)
    frac = L - lo
    if lo + 1 < len(arr):
        return arr[lo] + frac * (arr[lo + 1] - arr[lo])
    return arr[lo]

q1 = pctile(sorted_acc, 25)
q2 = pctile(sorted_acc, 50)
q3 = pctile(sorted_acc, 75)
iqr = q3 - q1

lower_fence = q1 - 1.5 * iqr
upper_fence = q3 + 1.5 * iqr

outliers = [v for v in accuracy if v < lower_fence or v > upper_fence]

print(f"Q1 (25th pct):   {q1:.4f}")
print(f"Q2 / Median:     {q2:.4f}")
print(f"Q3 (75th pct):   {q3:.4f}")
print(f"IQR:             {iqr:.4f}")
print(f"Lower fence:     {lower_fence:.4f}")
print(f"Upper fence:     {upper_fence:.4f}")
print(f"Outliers:        {outliers}")

print("\nPercentile ranks:")
for v in sorted_acc:
    rank = sum(1 for x in accuracy if x < v) / n * 100
    print(f"  acc={v:.2f}  rank={rank:.1f}%")
text
Q1 (25th pct):   0.7975
Q2 / Median:     0.8350
Q3 (75th pct):   0.8725
IQR:             0.0750
Lower fence:     0.6850
Upper fence:     0.9850
Outliers:        []

Percentile ranks:
  acc=0.78  rank=0.0%
  acc=0.79  rank=16.7%
  acc=0.82  rank=33.3%
  acc=0.85  rank=50.0%
  acc=0.88  rank=66.7%
  acc=0.91  rank=83.3%

Calculation Trace

PhaseFormulaValuesResult
Q1 positionfloor=1, frac=0.25
Q1 value0.7975
Q2 positionfloor=2, frac=0.5
Q2 value0.835
Q3 value0.8725
IQR0.075
Lower fence0.685
Upper fence0.985
Percentile rank (0.91) values below83.3%

When to Use Which

Percentile rank vs raw value: Report raw values when the audience understands the scale and can interpret the magnitude directly (e.g., accuracy = 0.91 to your ML team). Report percentile rank when comparing across different scales, tasks, or datasets — saying a model is at the 83rd percentile means the same thing regardless of whether accuracy runs from 0–1 or F1 runs from 0–100. Percentile rank is also preferable when presenting to non-technical stakeholders who need relative position, not absolute numbers.

IQR vs standard deviation for spread and outlier detection: IQR makes no distributional assumption — it only uses the ranks of Q1 and Q3. Standard deviation assumes that deviations from the mean are symmetric and that squaring them is meaningful. When a distribution is skewed or has heavy tails (latency, income, error counts), standard deviation amplifies the contribution of the extreme values and gives a misleading picture of typical spread. Use IQR when robustness matters.

Tukey 1.5 × IQR fences vs z-score outlier detection: The z-score method (flag values more than 2 or 3 standard deviations from the mean) assumes a normal distribution. On right-skewed data, the upper fence computed via z-scores is far too generous — many genuine outliers escape. IQR-based fences work on any distribution shape because they are constructed from quantiles, not moments. Prefer Tukey fences whenever you cannot confidently assert normality.

Percentiles and quartiles extend the toolkit built in the dispersion post. Variance and standard deviation measure spread but are sensitive to outliers and assume all deviations are symmetric around the mean. The IQR provides the same kind of spread measure without those assumptions — it is the natural complement to the median, just as standard deviation is the natural complement to the mean. Box plots visualize quartiles directly: once you understand Q1, Q2, Q3, and the Tukey fences, a box plot is just that information drawn on a number line. Percentiles are also the building block for probability distributions — the 25th percentile of the standard normal is approximately , and the quantile function (inverse CDF) is how you convert probability statements back into data-scale values. Confidence intervals, critical values, and p-values all ultimately rely on percentile-style reasoning applied to theoretical distributions.

When This Breaks Down

Percentile estimates are unstable with small samples. With six folds, Q1 is computed by interpolating between two data points. A different random train/test split could produce a Q1 anywhere in the 0.78–0.82 range. The estimated quartiles are noisy — they are not reliable estimates of the true population quartiles until you have at least 20–30 observations, and for the tails (P5, P95) you typically need hundreds.

The 1.5 × IQR outlier rule is widely used but not universal. For right-skewed distributions — latencies, error counts, feature importances — the upper fence can be too tight, flagging many legitimate values as outliers. For heavy-tailed distributions, 1.5 × IQR may be too lenient. Treat the flagged values as a prompt for investigation, not a verdict. If your dataset is small or skewed, bootstrap resampling of the quartiles gives a more honest picture of how much uncertainty surrounds the estimate.

Software disagreement on quartile values is also real. R uses nine different methods (types 1 through 9) and defaults to type 7. NumPy defaults to linear. Excel uses a different convention. If you share a quartile estimate with a colleague working in a different tool, expect small differences even on the same data.

Test Your Understanding

  1. Using the anchor dataset [0.78, 0.79, 0.82, 0.85, 0.88, 0.91], compute Q1 and Q3 by hand with the formula . Then compute IQR and the two Tukey fences. Would a fold accuracy of 0.65 be flagged as an outlier?

  2. A new fold returns accuracy 0.72. Add it to the dataset (now seven values), re-sort, and recompute Q1, Q3, IQR, and both fences. Does adding 0.72 change which other values (if any) get flagged?

  3. Your model's Q1 accuracy across 100 evaluation runs is 0.81. What does this tell you about reliability that the mean accuracy (0.85) does not?

  4. Two models are evaluated on 30 folds each. Model A has IQR = 0.04; Model B has IQR = 0.12. Both have the same mean accuracy. What does the difference in IQR tell you about deployment risk?

  5. Explain why percentile rank uses "strictly below" in its count. If a dataset has repeated values — say three folds all returning 0.85 — what percentile rank does each 0.85 receive, and why might that feel unintuitive?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment