Back to blog
← View series: statistics

~/blog

Pareto Distribution

Apr 11, 20267 min readBy mohammed.vasim
StatisticsMathData Science

In most ML systems, a small fraction of bugs produce the majority of user-facing errors. A small fraction of features contribute the majority of model accuracy. A small fraction of training examples drive the majority of gradient updates. This pattern — where a small minority accounts for a disproportionate majority — is the Pareto Principle, and it has a precise mathematical form: the Pareto distribution. Named after Vilfredo Pareto, who observed in 1906 that 20% of Italy's population owned 80% of its land, the Pareto distribution is the canonical parametric power law. Understanding its mathematics tells you not just where the 80-20 pattern comes from, but when it's a good model and when to expect the distribution to have no finite mean or variance.

The DS/ML anchor

Throughout this post we'll work with bug severity scores from a production ML system. Each bug is scored by its user impact (a composite score ranging from some minimum x_m = 1.0 up to unbounded severity). Over 500 bugs filed in the past year, the severity scores follow a Pareto distribution with shape parameter α = 2.1 and minimum value x_m = 1.0.

The Mathematics

The Pareto distribution has two parameters: x_m (the minimum value, a scale parameter) and α (the shape parameter, the Pareto index):

The survival function (complementary CDF):

The CDF:

With actual values (α = 2.1, x_m = 1.0):

Moments and Their Surprises

The mean exists only when α > 1:

The variance exists only when α > 2:

Standard deviation ≈ 4.17 — large relative to the mean of 1.91, confirming the heavy-tailed character.

When α ≤ 1, the mean is infinite. When α ≤ 2, the variance is infinite. For our bug severity with α = 2.1, variance barely exists — the distribution sits just at the edge of having a finite spread. These aren't mathematical curiosities: infinite variance means the sample variance doesn't converge as you collect more data. The sample mean converges (since α > 1), but much more slowly than for light-tailed distributions.

PDF

x_m=1 Median ≈1.43 Mean ≈1.91 heavy right tail Bug severity ~ Pareto(α=2.1, x_m=1): median < mean, heavy tail.

CDF

0 0.5 1 F(5) ≈ 0.951 5 1 2 20 CDF — F(5) ≈ 0.951, meaning 95% of bugs have severity below 5.

Trace Table: Bug Severity Calculations

With severity ~ Pareto(α = 2.1, x_m = 1.0):

PhaseFormulaValuesResult
P(severity > 2)(x_m / x)^α(1/2)^2.1 = 0.5^2.10.234
P(severity > 5)(1/5)^2.10.2^2.10.049
E[severity]α x_m / (α − 1)2.1 × 1 / 1.11.909
Medianx_m × 2^(1/α)1 × 2^(1/2.1)1.393

About 23% of bugs have severity above 2, and roughly 5% exceed severity 5. These "severity 5+" bugs — though rare — are the ones that cause production outages.

The 80-20 Connection

The 80-20 rule corresponds to α ≈ log(4) / log(5) ≈ 1.16. Here's the derivation:

If the top 20% of bugs cause 80% of issues, then the 80th percentile of bugs (the threshold separating "top 20%" from "bottom 80%") has P(X > x_80) = 0.2.

From the survival function: (x_m / x_80)^α = 0.2, giving x_80 = x_m × (1/0.2)^(1/α) = x_m × 5^(1/α).

When α = log(4)/log(5) ≈ 1.16, the fraction of "total impact" above the 80th percentile equals 80%. For our bugs with α = 2.1, the split is not 80-20 — it's less extreme, because larger α means lighter tails.

Python Implementation

python
from scipy import stats
import numpy as np

alpha, xmin = 2.1, 1.0
pareto_rv = stats.pareto(b=alpha, scale=xmin)

mean_sev   = pareto_rv.mean()
median_sev = pareto_rv.median()
std_sev    = pareto_rv.std()

print(f"Mean severity    : {mean_sev:.3f}")
print(f"Median severity  : {median_sev:.3f}")
print(f"Std deviation    : {std_sev:.3f}")

print(f"\nP(severity > 2) = {1 - pareto_rv.cdf(2):.4f}")
print(f"P(severity > 5) = {1 - pareto_rv.cdf(5):.4f}")
print(f"95th percentile  = {pareto_rv.ppf(0.95):.3f}")
print(f"99th percentile  = {pareto_rv.ppf(0.99):.3f}")

bugs = pareto_rv.rvs(size=500, random_state=42)
threshold_80 = np.percentile(bugs, 80)
top_20_impact  = bugs[bugs >= threshold_80].sum()
total_impact   = bugs.sum()
print(f"\nTop 20% of bugs account for {top_20_impact/total_impact*100:.1f}% of total severity")

def estimate_pareto_alpha(data, xmin):
    return len(data) / np.sum(np.log(data / xmin))

alpha_hat = estimate_pareto_alpha(bugs, xmin)
print(f"Estimated α = {alpha_hat:.3f}  (true α = {alpha})")
Mean severity : 1.909 Median severity : 1.393 Std deviation : 4.168 P(severity > 2) = 0.2336 P(severity > 5) = 0.0491 95th percentile = 4.797 99th percentile = 11.236 Top 20% of bugs account for 65.2% of total severity Estimated α = 2.087 (true α = 2.1)

Relationship to Power Law and Zipf

The Pareto distribution is the canonical parametric power law — it's the power law distribution with a specific minimum value x_m and closed-form expressions for all moments (when they exist). Zipf's law, which describes word frequencies and website ranks, is the discrete analogue: the k-th most frequent item has frequency proportional to k^(−1). If item sizes follow Pareto, their rank order follows Zipf. This explains why word frequencies, city populations, and software module sizes all show similar rank-frequency patterns.

Pareto is the parametric form of the power law distribution from the previous post, giving it closed-form expressions for CDF, quantiles, and moments. The connection to the 80-20 rule makes it the standard model for inequality measurement (Gini coefficient and Lorenz curves are directly derived from Pareto parameters). In software engineering, the Pareto distribution of bug severity motivates the practice of fixing the most severe bugs first — when α is small (heavy tail), a few critical bugs contribute a disproportionate fraction of all user impact, so prioritizing by severity has outsized payoff. Understanding Pareto is also the foundation for extreme value theory and Value-at-Risk calculations in risk management.

Honest Limitations

The Pareto distribution is a power law, so it inherits those limitations. The infinite moments for small α make traditional statistics unreliable: when α ≤ 2, sample variance doesn't stabilize as you add more data. When α ≤ 1, even sample means jump around.

The 80-20 rule is a rough guideline, not a universal truth. Pareto distributions with α = 1.5 give roughly 95-20 splits (top 20% account for 95% of total), while α = 3 gives roughly 60-20 splits. The specific split depends on α, and assuming 80-20 without checking α can lead to badly miscalibrated expectations.

Real bug severity distributions are rarely perfect Pareto from minimum value to infinity. They often have a lower tail that deviates (many bugs cluster at low severity without the full Pareto structure) and an upper cutoff where severity is capped. Fitting Pareto to such data without checking the fit over the full range will produce misleading parameter estimates.

Test Your Understanding

  1. For bug severity ~ Pareto(α = 2.1, x_m = 1), calculate P(severity > 3) and the 90th percentile severity score. Show the full calculation using the survival function formula.

  2. A team wants to allocate 80% of their bug-fixing effort to the bugs accounting for 80% of total user impact. Under Pareto(α = 2.1), what severity threshold corresponds to the top 20% of bugs? (Find the 80th percentile of the distribution.)

  3. If α = 1.5, does the variance of the Pareto distribution exist? What does it mean for a severity distribution to have infinite variance, in practical terms for a bug-tracking system?

  4. Compare two bug distributions: Pareto(α = 1.5, x_m = 1) and Pareto(α = 3.0, x_m = 1). Which has a heavier tail? For which is the mean closer to the median? Compute both means and medians.

  5. The Hill estimator for α uses the formula α̂ = n / Σ ln(x_i / x_m). For the 10 highest severity bugs {5.2, 7.1, 3.4, 8.9, 4.6, 12.3, 6.8, 4.1, 9.7, 5.5} with x_m = 1, estimate α. Does this α imply finite mean? Finite variance?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment