Back to blog
← View series: statistics

~/blog

Types of Probability Distributions

Apr 11, 20267 min readBy mohammed.vasim
StatisticsMathData Science

Every time you choose a model in ML — a loss function, a likelihood, a prior — you're implicitly choosing a distribution. A Gaussian loss assumes your residuals are normally distributed. A cross-entropy loss assumes Bernoulli outputs. A Poisson regression assumes counts with mean equal to variance. Making that choice explicitly, rather than accepting defaults, is what separates a competent practitioner from one who debugs blindly when things go wrong.

The first and most important distinction is whether you're counting things or measuring them.

The DS/ML anchor

Throughout this post we'll use two running examples from the same ML team:

  • Bug counts per two-week sprint: discrete — you count bugs, integers only
  • Model accuracy scores across experiments: continuous — measured as a real number between 0 and 1

These two quantities represent the two fundamental families, and every decision in this post maps back to one of them.

Discrete vs Continuous

When data comes in whole, separate units — bug counts, user clicks, query failures, defective items in a batch — you're dealing with a discrete distribution. Each countable outcome gets its own probability, and P(X = x) is genuinely the probability of that exact integer value.

Continuous distributions apply when you're measuring something that can take any value in a range — accuracy, latency, prediction error. The probability of any single exact point is technically zero. You work with probability density instead, and probability lives in intervals.

AspectDiscreteContinuous
ValuesSeparate, countable integersAny value in an interval
FunctionPMF — probability massPDF — probability density
Point probabilityP(X = x) can be > 0P(X = x) = 0 always
ExampleBugs per sprintAccuracy per experiment
Counting or measuring? counting measuring Discrete Continuous Bernoulli Binomial Poisson Normal Uniform Exponential Log-Normal The counting-vs-measuring question determines which family to start with.

The Workhorses: Common Discrete Distributions

When modeling counts of binary outcomes, the Bernoulli distribution is the starting point. One trial, two outcomes — a test passes or fails, a deployment succeeds or fails.

When you have multiple independent Bernoulli trials — how many of ten A/B test variants produced a statistically significant lift — the Binomial distribution gives you the probability of exactly k successes.

The Poisson distribution applies when counting events at a constant average rate: bugs filed per sprint, model retrain triggers per week, alerts per hour. You only need one parameter — λ, the average rate. For our sprint bug count, if the team historically sees an average of 3.2 bugs per sprint, that's your λ.

Geometric and Negative Binomial distributions extend these ideas to "how many trials until something happens" — useful for modeling how many experiments until you find a model that meets accuracy requirements.

The Classics: Common Continuous Distributions

The Normal (Gaussian) distribution is the default choice for continuous data affected by many small independent factors. Accuracy scores across many randomly-seeded training runs often look approximately normal because random initialization, data shuffling, and other small sources of variation add together.

The Uniform distribution treats all values in an interval as equally likely. In ML practice, uniform distributions appear as priors in hyperparameter search — you might sample learning rates uniformly from [1e-5, 1e-1] on a log scale.

The Exponential distribution describes waiting times: how long until the next model failure, the next retraining trigger, the next bug report. It's memoryless — the probability of waiting another hour doesn't depend on how long you've already waited.

The Log-Normal distribution emerges when a variable is the product of many independent factors rather than their sum. Training time across experiments often looks log-normal: most runs finish in 2–4 hours, but a few pathological cases run for 20 hours.

Trace Table: Choosing a Distribution for Bug Counts

Suppose the team tracked 10 sprints: bug counts were {2, 4, 1, 3, 5, 2, 3, 4, 3, 2}.

Decision StepQuestionAnswerImplication
Count or measure?Are bugs integers?Yes — whole numbers onlyUse discrete distribution
Fixed trials or rate?Do we know max possible bugs?No upper boundPoisson is appropriate
Estimate rateWhat's the sample mean?(2+4+1+3+5+2+3+4+3+2)/10 = 2.9λ = 2.9 bugs/sprint
Check equidispersionSample variance ≈ mean?Variance ≈ 1.4, mean = 2.9Possible underdispersion — worth checking

Choosing the Right Distribution

First question: discrete or continuous? Counting bugs or clicks suggests Binomial or Poisson. Measuring accuracy or latency suggests Normal, Exponential, or Log-Normal.

Second question: what's the shape? Data bounded between 0 and 1 with no natural structure often suggests Uniform or Beta. Data that clusters near zero with a long right tail suggests Poisson or Exponential. Symmetric bell-shaped data suggests Normal. Right-skewed continuous data that must be positive suggests Log-Normal.

Third question: what generates the data? Independent binary trials → Binomial. Rare events in a fixed interval → Poisson. Waiting time between independent events → Exponential. Products of many factors → Log-Normal.

For our two anchors: bug counts per sprint map to Poisson(λ ≈ 2.9), and accuracy scores across experiments — if they're clustered symmetrically around a mean — map to Normal(μ, σ²).

How Distributions Connect

Normal is the limit of Binomial when n is large. Poisson approaches Normal when λ is large. Log-Normal is what you get when you exponentiate a Normal variable. These connections matter practically: when n is large enough, the simpler approximation works, and you get closed-form calculations instead of sums over combinatorials.

Binomial and Poisson connect when n is large, p is small, and np is roughly constant — this is the Poisson approximation to Binomial, useful when you know the rate but not the exact trial count.

Python Implementation

python
from scipy import stats
import numpy as np

# Discrete: bug counts per sprint, Poisson(lambda=2.9)
lambda_bugs = 2.9
for k in range(7):
    prob = stats.poisson.pmf(k, lambda_bugs)
    print(f"P(bugs={k}) = {prob:.4f}")

# Continuous: accuracy scores, Normal(mu=0.847, sigma=0.031)
accuracy_mu, accuracy_sigma = 0.847, 0.031
p_above_threshold = 1 - stats.norm.cdf(0.88, accuracy_mu, accuracy_sigma)
print(f"\nP(accuracy > 0.88) = {p_above_threshold:.4f}")
print(f"90th percentile accuracy = {stats.norm.ppf(0.90, accuracy_mu, accuracy_sigma):.4f}")
P(bugs=0) = 0.0550 P(bugs=1) = 0.1596 P(bugs=2) = 0.2314 P(bugs=3) = 0.2237 P(bugs=4) = 0.1622 P(bugs=5) = 0.0941 P(bugs=6) = 0.0455 P(accuracy > 0.88) = 0.0668 90th percentile accuracy = 0.8867

Understanding discrete vs continuous distributions depends on the PMF, PDF, and CDF concepts from the previous post — those are the tools you use once you've chosen a family. The posts that follow each examine a single distribution in depth: Bernoulli, Binomial, and Poisson for the discrete family; Normal, Uniform, Log-Normal, and Pareto for the continuous family. The choice you make at this branching point determines which assumptions are baked into your model, which approximations are valid, and which diagnostics you should run to check the fit.

Honest Limitations

All of this assumes your data fits a single, standard distribution. Real data is messier — multiple peaks, heavy tails, boundary effects, or skewness that doesn't match any standard family. A dataset of accuracy scores from early training runs might be bimodal (some models converged, some didn't). Bug counts might have zero-inflation (some sprints nothing fails at all).

Before fitting any distribution, plot your data. A histogram and empirical CDF will tell you far more than theoretical arguments.

Test Your Understanding

  1. Your team tracks model inference failures per hour. You observe values {0, 0, 1, 0, 3, 0, 1, 2, 0, 1}. Which distribution family does this suggest, and what parameter would you estimate from this data?

  2. Accuracy scores from 50 training runs have mean 0.832 and standard deviation 0.028. What distribution would you first try to fit? What diagnostic would you run to check if the fit is reasonable?

  3. A colleague says "our response time is Normally distributed because it's a continuous variable." What's wrong with this reasoning, and what would you actually check?

  4. The Binomial(n=1000, p=0.003) should behave like Poisson(λ) for some λ. What is λ? Under what conditions does this approximation work well?

  5. Training time for a new architecture is always positive and right-skewed — most runs take 3–5 hours but some take 20+. Which continuous distribution is worth trying first, and why?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment