Back to blog
← View series: statistics

~/blog

Gamma Distribution

Apr 14, 202611 min readBy Mohammed Vasim
StatisticsMathData Science

The Exponential distribution models the time until the first event. When you need the time until the k-th event — the total duration of k consecutive training jobs, or the wait for k errors to accumulate — you need the Gamma distribution. It is the direct generalization of the Exponential and the foundation for the Chi-squared and Erlang distributions.

Motivation: Sum of Exponentials

Suppose each training job takes a random amount of time T_i ~ Exponential(λ=0.5 jobs/hour) — meaning an average of 1/λ = 2 hours per job. You want to know the total time for 5 consecutive jobs:

T = T₁ + T₂ + T₃ + T₄ + T₅

Each T_i independently follows the Exponential distribution. Their sum follows the Gamma distribution:

T ~ Gamma(k=5, λ=0.5)

More generally: if T₁, T₂, ..., T_k ~ Exponential(λ) independently, then T₁+...+T_k ~ Gamma(k, λ).

5 training jobs → total time ~ Gamma(5, 0.5) T₁ T₂ T₃ T₄ T₅ T = T₁+T₂+T₃+T₄+T₅ ~ Gamma(5, λ=0.5) E[T]=10h ~Exp(0.5)

Parameter Conventions

The Gamma distribution has two common parameterizations. Always know which is in use:

ConventionParametersMeanVarianceUsed in
Shape-Ratek, λ (rate)k/λk/λ²Bayesian statistics, Poisson conjugacy
Shape-Scaleα, θ=1/λ (scale)αθαθ²Engineering, scipy.stats.gamma

scipy uses Shape-Scale: scipy.stats.gamma(a=k, scale=θ) where θ = 1/λ.

Anchor: k=5, λ=0.5, so θ = 1/0.5 = 2 hours per job. Use scipy.stats.gamma(a=5, scale=2).

The Gamma Function

The PDF involves Γ(k) — the Gamma function, which generalizes factorial to non-integers:

Γ(k+1) = k × Γ(k) (recurrence relation)

Γ(n) = (n−1)! for positive integers

Γ(1/2) = √π (special value)

kΓ(k)
1Γ(1) = 1 = 0!
2Γ(2) = 1 = 1!
3Γ(3) = 2 = 2!
4Γ(4) = 6 = 3!
5Γ(5) = 24 = 4!

For the anchor: Γ(5) = 24. This is the normalization constant in the PDF denominator.

PDF

f(x; k, λ) = λ^k × x^{k−1} × e^{−λx} / Γ(k) for x ≥ 0

Unpacking each component:

  • λ^k / Γ(k): normalization constant ensuring ∫₀^∞ f(x)dx = 1
  • x^{k−1}: polynomial factor — at k=1 this is x⁰=1 (no polynomial), at k=5 it is x⁴ (creates a peak)
  • e^{−λx}: exponential decay ensuring the integral is finite

Connection to Exponential: when k=1, x^{k−1}=1 and Γ(1)=1, so f(x) = λe^{−λx} — exactly the Exponential PDF. Gamma(1, λ) = Exponential(λ).

Compute f(10) on anchor (k=5, λ=0.5):

f(10) = 0.5⁵ × 10⁴ × e^{−5} / 24 = 0.03125 × 10000 × 0.00674 / 24 = 312.5 × 0.00674 / 24 ≈ 0.0878

Compute f(15):

f(15) = 0.03125 × 15⁴ × e^{−7.5} / 24 = 0.03125 × 50625 × 0.000553 / 24 ≈ 0.0366

Gamma PDF — convergence from Exponential to near-Normal f(x) x (h) 0 10 20 30 40 50 0.5 0 k=1 (Exp) k=2 k=5 (anchor, peak≈8h) k=20 (≈Normal) f(10)=0.088 As k increases, shape becomes symmetric — CLT (sum of k IID terms)

CDF

No closed form — uses the regularized incomplete Gamma function:

F(x; k, λ) = γ(k, λx) / Γ(k)

where γ(k, λx) = ∫₀^{λx} t^{k−1}e^{−t}dt is the lower incomplete Gamma function. Computed numerically in scipy.

Anchor queries (k=5, λ=0.5, scale=2):

QueryResultInterpretation
P(T ≤ 10)0.559556% chance all 5 jobs finish within 10 hours
P(T > 12)0.285129% chance it takes more than 12 hours
Median9.67 hHalf the time, all 5 jobs finish within 9.67 h
95th percentile16.81 hNeed to budget 16.8 h to cover 95% of scenarios

Mean, Variance, and Skewness

Mean: E[T] = k/λ = k × (mean of Exponential)

Derivation from sum structure: T is the sum of k independent Exponentials, each with mean 1/λ. By linearity of expectation:

E[T] = E[T₁] + ... + E[T_k] = k × (1/λ) = k/λ

Anchor: E[T] = 5/0.5 = 10 hours. This is the peak of the mode? No — the mode is at (k−1)/λ = 4/0.5 = 8 hours. The mean is pulled right by the positive tail.

Variance: Var(T) = k/λ²

Derivation from independence: variance of a sum of independent variables is the sum of variances:

Var(T) = Var(T₁) + ... + Var(T_k) = k × (1/λ²) = k/λ²

Anchor: Var(T) = 5/0.25 = 20, SD = √20 ≈ 4.47 hours

Skewness: 2/√k

Anchor (k=5): skewness = 2/√5 ≈ 0.894 — moderately right-skewed.

For k=20: skewness = 2/√20 ≈ 0.447 — nearly symmetric.

As k → ∞: skewness → 0. This is the CLT in action — Gamma(k, λ) is the sum of k IID Exponentials, so it converges to Normal(k/λ, k/λ²) by the CLT. The Gamma PDF panels above show this convergence.

Special Cases

ParametersSpecial distributionRelationship
Gamma(1, λ)Exponential(λ)Simplest case: single event
Gamma(k, 1/2)Chi-squared(2k)Chi-sq is a scaled Gamma
Gamma(k/2, 1/2)Chi-squared(k)Standard Chi-sq parameterization
Gamma(k=integer, λ)Erlang(k, λ)Integer-shape special case

Chi-squared connection (preview for spec 34): X₁², X₂², ..., X_k² where Xᵢ ~ N(0,1). Their sum X₁²+...+X_k² ~ χ²(k). This χ²(k) distribution is exactly Gamma(k/2, 1/2). Equivalently: Gamma(α, β) = (2/λ) × χ²(2α) in distribution. The Gamma is the parent distribution of the Chi-squared.

Erlang connection: the Erlang distribution is Gamma with integer shape, used in queuing theory. For our anchor, T ~ Erlang(5, 0.5) — an integer number of processing stages.

Gamma Function Deep Dive

Three properties worth knowing:

  1. Recurrence: Γ(k+1) = k × Γ(k). This lets you compute any Γ by stepping down to Γ(1)=1.

  2. Non-integer values: Γ(1/2) = √π ≈ 1.772. This is what appears in the Normal distribution's normalization constant: 1/√(2π) comes from the Gamma function evaluated at 1/2.

  3. Stirling's approximation: log Γ(k) ≈ k log k − k for large k — used when computing the PDF in log space to avoid overflow.

For the anchor normalization: Γ(5) = 4! = 24. The PDF integrates to:

∫₀^∞ λ^k x^{k−1} e^{−λx}/Γ(k) dx = (λ^k/Γ(k)) × Γ(k)/λ^k = 1 ✓

Conjugate Prior for Poisson Rate

The Gamma distribution is the conjugate prior for the Poisson rate parameter λ:

Prior: λ ~ Gamma(α, β) Likelihood: X₁, ..., X_n ~ Poisson(λ) — observing event counts Posterior: λ | data ~ Gamma(α + Σxᵢ, β + n)

Update rule (using shape-rate parameterization):

  • α_posterior = α_prior + total observed events
  • β_posterior = β_prior + number of time periods observed
  • Posterior mean = α_posterior / β_posterior

Concrete example: ML training cluster has an error rate. Prior belief: λ ~ Gamma(3, 1) — like having seen 3 errors in 1 hour (mean = 3 errors/hour). Observe 15 errors over 4 hours.

Posterior: λ | data ~ Gamma(3+15, 1+4) = Gamma(18, 5)

Posterior mean = 18/5 = 3.6 errors/hour

Compare to frequentist estimate: λ̂ = total errors / total time = 15/4 = 3.75 errors/hour. The Bayesian estimate shrinks toward the prior mean of 3.0, weighted by the relative strength of prior vs data.

Code

python
from scipy import stats
import numpy as np

# Anchor: k=5 jobs, lambda=0.5 jobs/hour → scale=2 hours/job
k = 5
lam = 0.5
scale = 1 / lam  # = 2.0

dist = stats.gamma(a=k, scale=scale)

# Mean, variance, skewness
mean_theory = k / lam
var_theory   = k / lam**2
skewness     = 2 / np.sqrt(k)
print(f"Gamma({k}, λ={lam})  →  scale={scale}")
print(f"Mean:     {mean_theory:.2f} h  (E[T]=k/λ)")
print(f"Variance: {var_theory:.2f}     (Var=k/λ²)  SD={var_theory**0.5:.4f}")
print(f"Skewness: {skewness:.4f}  (= 2/√k)")

# CDF queries
print(f"\nP(T ≤ 10) = {dist.cdf(10):.4f}")
print(f"P(T > 12) = {dist.sf(12):.4f}")
print(f"Median:    {dist.median():.2f} h")
print(f"95th pct:  {dist.ppf(0.95):.2f} h")

# PDF at specific points
print(f"\nf(10) = {dist.pdf(10):.4f}")
print(f"f(15) = {dist.pdf(15):.4f}")

# Special case: Gamma(1, 0.5) = Exponential(0.5)
exp_via_gamma = stats.gamma(a=1, scale=2)
exp_direct    = stats.expon(scale=2)
print(f"\nGamma(1,2) CDF at x=4: {exp_via_gamma.cdf(4):.4f}")
print(f"Expon(2)   CDF at x=4: {exp_direct.cdf(4):.4f}  (should match)")

# Conjugate prior for Poisson rate
alpha_prior, beta_prior = 3, 1   # Gamma(3,1) prior (shape-rate)
observed_events, periods = 15, 4
alpha_post = alpha_prior + observed_events
beta_post  = beta_prior  + periods
print(f"\nPoisson conjugacy:")
print(f"Prior: Gamma({alpha_prior},{beta_prior}) → mean rate = {alpha_prior/beta_prior:.1f} errors/h")
print(f"After {observed_events} errors in {periods} hours:")
print(f"Posterior: Gamma({alpha_post},{beta_post}) → mean rate = {alpha_post/beta_post:.2f} errors/h")
print(f"Frequentist MLE: {observed_events/periods:.2f} errors/h")
Gamma(5, λ=0.5) → scale=2.0 Mean: 10.00 h (E[T]=k/λ) Variance: 20.00 (Var=k/λ²) SD=4.4721 Skewness: 0.8944 (= 2/√k) P(T ≤ 10) = 0.5595 P(T > 12) = 0.2851 Median: 9.67 h 95th pct: 16.81 h f(10) = 0.0878 f(15) = 0.0366 Gamma(1,2) CDF at x=4: 0.8647 Expon(2) CDF at x=4: 0.8647 (should match) Poisson conjugacy: Prior: Gamma(3,1) → mean rate = 3.0 errors/h After 15 errors in 4 hours: Posterior: Gamma(18,5) → mean rate = 3.60 errors/h Frequentist MLE: 3.75 errors/h

ML Applications

1 — Training pipeline scheduling. When k pipeline stages (data preprocessing, training, evaluation, upload) each take Exponential time, the total job duration is Gamma distributed. This lets you compute P(pipeline finishes within SLA) and set timeout policies.

2 — Bayesian Poisson rate estimation. Error rates, click rates, and API call rates are Poisson processes. The Gamma prior updates analytically as new observations arrive — useful in online monitoring systems that must update beliefs without reprocessing all historical data.

3 — Erlang queuing models. In M/Erlang queues, service times follow Erlang(k, λ) = Gamma(k, λ) with integer k. This models services that require k sequential phases, each Exponential — more realistic than a single Exponential when there are multiple processing stages (e.g., inference with preprocessing + model forward pass + post-processing).

4 — Gamma priors in Bayesian regression. In Bayesian linear regression, placing a Gamma prior on the precision (inverse variance) τ = 1/σ² yields a Gamma-Normal conjugate model. The posterior for τ is also Gamma, giving closed-form expressions for the posterior predictive distribution.

  • Exponential distribution: Gamma(1, λ) — the single-event special case
  • Chi-squared distribution: Gamma(k/2, 1/2) — used for sum of squared standard Normal variables
  • Beta distribution: like the Beta, the Gamma appears as a conjugate prior in Bayesian inference, but for rate rather than proportion parameters
  • Poisson distribution: the Gamma is the conjugate prior for the Poisson rate parameter

Limitations

  • Sum-of-Exponentials interpretation requires i.i.d.: the Gamma(k, λ) decomposition assumes all k stages have the same rate λ. If job times have different rates (pre-processing is faster than training), the sum is a hypo-exponential distribution, not Gamma.
  • Integer k restriction for Erlang: for queuing applications, only integer k makes sense physically. Non-integer α shapes are mathematical but have no physical sum-of-Exponentials interpretation.
  • scipy parameterization trap: scipy.stats.gamma(a=5, scale=2) uses scale (not rate). Always verify: dist.mean() should equal k × scale = 5 × 2 = 10. If it doesn't, the parameters are swapped.

Test Your Understanding

  1. Five data processing jobs have Exponential(λ=0.5) durations. What is the probability that all five finish within 8 hours? Use the CDF formula and state what scipy call you would use.

  2. Compare Gamma(5, λ=0.5) with Gamma(20, λ=0.5). Both have the same λ but different k. Compute the mean, variance, and skewness for each. How does doubling k from 10 to 20 affect the coefficient of variation (CV = SD/mean)?

  3. Prove that Gamma(1, λ) = Exponential(λ) by substituting k=1 into the Gamma PDF formula and simplifying. What values do Γ(1) and x^{k−1} take?

  4. The Poisson conjugacy example used a Gamma(3, 1) prior and observed 15 errors in 4 hours. What would the posterior mean be if you used an uninformative prior Gamma(0.001, 0.001) instead? As n → ∞, does the prior choice matter?

  5. You are told a service's end-to-end latency is Gamma(k=3, λ=2) milliseconds (three sequential Exponential stages, each with mean 0.5 ms). Your SLA requires P(latency > 3 ms) < 0.05. Does this service meet the SLA? Compute the answer using the survival function.

Comments (0)

No comments yet. Be the first to comment!

Leave a comment