← View series: statistics
~/blog
Multinomial Distribution
When you evaluate a text classifier on 100 test documents and count how many land in each of three classes — tech, sports, politics — the resulting triple (X₁, X₂, X₃) follows a Multinomial distribution. It is the natural generalization of the Binomial from two outcomes to k outcomes, and it is the probabilistic foundation behind softmax classifiers, cross-entropy loss, and Multinomial Naive Bayes.
From Binomial to Multinomial
Binomial: n independent trials, each with 2 outcomes (success/failure), success probability p fixed.
Multinomial: n independent trials, each with k ≥ 2 outcomes, category probabilities (p₁, p₂, ..., pₖ) fixed, where Σpᵢ = 1.
The Binomial is Multinomial(n, k=2). Every formula generalizes:
| Property | Binomial | Multinomial |
|---|---|---|
| Outcomes | 2 | k ≥ 2 |
| Counts | (X, n−X) | (X₁, X₂, ..., Xₖ) |
| PMF coefficient | C(n,x) = n!/(x!(n−x)!) | n!/(x₁!x₂!...xₖ!) |
| Full PMF | C(n,x)p^x(1−p)^{n−x} | n!/(x₁!...xₖ!) × p₁^{x₁}×...×pₖ^{xₖ} |
| Mean | np | npᵢ per category |
| Variance | np(1−p) | npᵢ(1−pᵢ) per category |
| Covariance | — | −npᵢpⱼ for i≠j |
The DS/ML Anchor
A 3-class text classifier evaluated on 100 test documents:
n = 100
p₁ = 0.50 (tech)
p₂ = 0.30 (sports)
p₃ = 0.20 (politics)
(X₁, X₂, X₃) ~ Multinomial(n=100, p=(0.50, 0.30, 0.20))
PMF
P(X₁=x₁, X₂=x₂, X₃=x₃) = n!/(x₁!x₂!x₃!) × p₁^{x₁} × p₂^{x₂} × p₃^{x₃}
subject to: x₁ + x₂ + x₃ = n, all xᵢ ≥ 0.
Deriving the multinomial coefficient: n!/(x₁!x₂!...xₖ!) counts the number of ways to arrange n distinguishable items into k groups of sizes x₁, x₂, ..., xₖ:
- Choose x₁ positions from n for category 1: C(n, x₁) = n!/(x₁!(n−x₁)!)
- Choose x₂ positions from remaining n−x₁ for category 2: C(n−x₁, x₂)
- Remaining x₃ positions are forced to category 3: C(x₃, x₃) = 1
Multiplying: the (n−xᵢ)! factors cancel telescopically, leaving n!/(x₁!x₂!x₃!). This directly generalizes C(n,x) = n!/(x!(n−x)!): when k=2, the multinomial coefficient reduces to the binomial coefficient.
Why log computation? 100! ≈ 9.3×10¹⁵⁷ — direct computation overflows. Compute in log space:
log P = log(n!) − Σᵢ log(xᵢ!) + Σᵢ xᵢ log(pᵢ)
Example 1 — expected counts (x₁=50, x₂=30, x₃=20):
log P = [log(100!) − log(50!) − log(30!) − log(20!)] + [50×log(0.50) + 30×log(0.30) + 20×log(0.20)]
= [363.74 − 148.48 − 74.66 − 42.34] + [−34.66 + (−36.12) + (−32.19)]
= 98.26 + (−102.97) = −4.71
P(X₁=50, X₂=30, X₃=20) = e^{−4.71} ≈ 0.0090
This is the mode (the maximum of the PMF): the counts that exactly match the expected values E[Xᵢ] = npᵢ have the highest probability.
Example 2 — off-expected counts (x₁=55, x₂=28, x₃=17):
log P = [log(100!) − log(55!) − log(28!) − log(17!)] + [55×log(0.50) + 28×log(0.30) + 17×log(0.20)]
= [363.74 − 168.09 − 63.92 − 33.92] + [−38.12 + (−33.71) + (−27.36)]
= 97.81 + (−99.19) = −1.38...
Wait — that result contradicts the mode claim. Let me recalculate: 55+28+17=100 ✓. The multinomial coefficient for (55,28,17) vs (50,30,20) differs because we're putting 5 more items into category 1 and fewer into 2 and 3. The coefficient log ratio:
log[coeff(55,28,17)] − log[coeff(50,30,20)] = −[log(51)+...+log(55)] + [log(29)+log(30)] + [log(18)+log(19)+log(20)]
= −[3.932+3.951+3.970+3.989+4.007] + [3.367+3.401] + [2.890+2.944+2.996]
= −19.849 + 6.768 + 8.830 = −4.251
And the log-probability ratio:
[55−50]×log(0.5) + [28−30]×log(0.3) + [17−20]×log(0.2)
= 5×(−0.693) + (−2)×(−1.204) + (−3)×(−1.609)
= −3.466 + 2.408 + 4.827 = +3.769
log P ratio = −4.251 + 3.769 = −0.482 → P(55,28,17)/P(50,30,20) ≈ e^{−0.482} ≈ 0.617
P(X₁=55, X₂=28, X₃=17) ≈ 0.617 × 0.0090 ≈ 0.0056
The departure from expected counts reduces the probability by ~39%. The code below gives the exact values.
Mean, Variance, and Covariance
Each category's marginal is a Binomial (treating all other categories as "failure"), so Binomial formulas apply directly:
E[Xᵢ] = n × pᵢ
Var(Xᵢ) = n × pᵢ × (1−pᵢ)
Cov(Xᵢ, Xⱼ) = −n × pᵢ × pⱼ for i ≠ j
Why is covariance always negative? Because X₁ + X₂ + X₃ = n is a fixed constraint. If tech gets more documents than expected (X₁ > E[X₁] = 50), those extra documents must come from somewhere — some other category must have fewer. The counts compete for the same n trials, so they are negatively correlated by construction.
This has direct implications for multi-class model evaluation: if a classifier improves recall on tech, it will, on average, see a reduction elsewhere — not because of model quality but because of the sum constraint.
Computed on anchor (n=100, p=(0.50, 0.30, 0.20)):
| Category | E[Xᵢ] = npᵢ | Var(Xᵢ) = npᵢ(1−pᵢ) | SD |
|---|---|---|---|
| Tech (i=1) | 100×0.50 = 50.0 | 100×0.50×0.50 = 25.0 | 5.0 |
| Sports (i=2) | 100×0.30 = 30.0 | 100×0.30×0.70 = 21.0 | 4.58 |
| Politics (i=3) | 100×0.20 = 20.0 | 100×0.20×0.80 = 16.0 | 4.0 |
Covariances:
- Cov(X₁, X₂) = −100 × 0.50 × 0.30 = −15.0
- Cov(X₁, X₃) = −100 × 0.50 × 0.20 = −10.0
- Cov(X₂, X₃) = −100 × 0.30 × 0.20 = −6.0
Variance-covariance matrix:
Note: Σ of each row = 25−15−10 = 0. This is required because X₁+X₂+X₃=n has zero variance — any row/column sums to zero.
Marginalization to Binomial
Any single category's marginal distribution is Binomial. Formally, summing the Multinomial PMF over all values of x₃ such that x₃ = n−x₁−x₂:
Σ_{x₃≥0} P(X₁=x₁, X₂=x₂, X₃=n−x₁−x₂) = [n!/(x₁!x₂!(n−x₁−x₂)!)] × p₁^{x₁} × p₂^{x₂} × p₃^{n−x₁−x₂}
Summing further over x₂ (treating categories 2 and 3 together as "not category 1"):
Σ_{x₂} [above] = C(n, x₁) × p₁^{x₁} × (p₂+p₃)^{n−x₁} = C(n, x₁) × p₁^{x₁} × (1−p₁)^{n−x₁}
This is exactly Binomial(n, p₁). The Multinomial "collapses" to a Binomial whenever you ignore all but one category.
Similarly, the combined count (X₁+X₂) ~ Binomial(n, p₁+p₂) — treating "tech or sports" as one category vs "politics."
Numerical verification: P(X₁ > 55 | n=100, p₁=0.50) using Binomial(100, 0.50):
P(X₁ > 55) = 1 − Binomial_CDF(55; n=100, p=0.50) = 1 − 0.8643 = 0.1357
About 13.6% of the time, tech will get more than 55 documents even if the true probability is exactly 0.50.
Two Sampling Interpretations
The Multinomial has two equivalent ways to think about generating a sample:
-
Drawing with replacement: draw n items from a population. Each item independently belongs to category i with probability pᵢ. Count how many from each category.
-
Sequential assignment: for each of n documents, assign it to tech/sports/politics with probabilities (0.50, 0.30, 0.20). Count the assignments.
Both produce the same distribution. In practice, np.random.multinomial(n, p) implements interpretation 2.
Code
import numpy as np
from scipy import stats
n = 100
p = [0.50, 0.30, 0.20]
labels = ["tech", "sports", "politics"]
# Sample one draw from Multinomial
rng = np.random.default_rng(42)
sample = rng.multinomial(n, p)
print("Single sample:", dict(zip(labels, sample)))
# PMF at expected counts using scipy
pmf_expected = stats.multinomial.pmf([50, 30, 20], n=n, p=p)
pmf_actual = stats.multinomial.pmf([55, 28, 17], n=n, p=p)
print(f"\nP(50, 30, 20) = {pmf_expected:.6f} (log = {np.log(pmf_expected):.4f})")
print(f"P(55, 28, 17) = {pmf_actual:.6f} (log = {np.log(pmf_actual):.4f})")
# Manual log-PMF
counts_ex = [50, 30, 20]
log_pmf = (
np.sum(np.log(np.arange(1, n+1))) -
np.sum([np.sum(np.log(np.arange(1, c+1))) for c in counts_ex]) +
np.sum([c * np.log(pi) for c, pi in zip(counts_ex, p)])
)
print(f"\nManual log P(50,30,20) = {log_pmf:.4f}")
# Simulate 10,000 draws and verify means and covariances
samples = rng.multinomial(n, p, size=10_000)
print(f"\nEmpirical means: {samples.mean(axis=0)}") # [50, 30, 20]
print(f"Expected means: {[n*pi for pi in p]}")
print(f"\nEmpirical Cov(X1,X2): {np.cov(samples[:,0], samples[:,1])[0,1]:.2f}")
print(f"Theoretical: {-n*p[0]*p[1]:.2f}")
# Marginalization: X1 ~ Binomial(n, p1)
p_x1_gt55 = 1 - stats.binom.cdf(55, n=n, p=p[0])
print(f"\nP(X1 > 55 | Binomial) = {p_x1_gt55:.4f}")Single sample: {'tech': 52, 'sports': 28, 'politics': 20}
P(50, 30, 20) = 0.009006 (log = -4.7098)
P(55, 28, 17) = 0.005554 (log = -5.1944)
Manual log P(50,30,20) = -4.7098
Empirical means: [50.01 30.03 19.96]
Expected means: [50.0, 30.0, 20.0]
Empirical Cov(X1,X2): -14.97
Theoretical: -15.00
P(X1 > 55 | Binomial) = 0.1357
ML Applications
1 — Multinomial Naive Bayes for Text
In Multinomial Naive Bayes, a document is modeled as a bag-of-words count vector (x₁, x₂, ..., x_V) where xᵢ is how many times word i appears. Given class c, this count vector is assumed to follow Multinomial(n, p₁|c, ..., p_V|c) where pᵢ|c = P(word i | class c).
The class-conditional log-likelihood: log P(doc | class c) = log[n!/∏xᵢ!] + Σᵢ xᵢ log(pᵢ|c)
The log[n!/∏xᵢ!] term is the same for all classes and cancels in the argmax — so classification reduces to comparing Σᵢ xᵢ log(pᵢ|c) across classes.
2 — Softmax Output Layer
A neural network's final softmax layer outputs p̂ = (p̂₁, ..., p̂ₖ): a Multinomial probability vector for a single trial. The cross-entropy loss for one observation is:
L = −Σᵢ yᵢ log(p̂ᵢ)
where yᵢ is the one-hot true label (yᵢ=1 for the true class, 0 elsewhere). Since only one yᵢ=1:
L = −log(p̂_{true class})
This is exactly the negative log-likelihood of a Multinomial(n=1, p̂) observation — the probability of seeing exactly the true class. Summing over the dataset gives the total negative log-likelihood. Training with cross-entropy loss is maximum likelihood estimation of a Multinomial model. This is not a design choice — it follows directly from modeling each prediction as a Multinomial trial.
3 — Dirichlet-Multinomial (Bayesian Text Modeling)
The Dirichlet distribution is the conjugate prior for Multinomial parameters. If p ~ Dirichlet(α₁, ..., αₖ) and X|p ~ Multinomial(n, p), then after observing counts (x₁, ..., xₖ):
p | X ~ Dirichlet(α₁+x₁, α₂+x₂, ..., αₖ+xₖ)
This is the basis for Laplace smoothing in Naive Bayes: adding αᵢ=1 to each count before normalizing corresponds to a Dirichlet(1,1,...,1) prior — the uniform distribution over the simplex.
4 — Chi-Square Independence Test
Under the independence hypothesis H₀, the observed contingency table cell counts follow a Multinomial distribution with probabilities pᵢⱼ = pᵢ × pⱼ (row marginal × column marginal). The chi-square statistic Σ (O−E)²/E is an approximation to the likelihood ratio test for a Multinomial model. The Multinomial distribution is the theoretical foundation that makes the chi-square test valid.
Cross-Entropy IS Multinomial Negative Log-Likelihood
The connection is exact, not analogical. For a single observation (n=1 Multinomial trial):
Multinomial PMF: P(X=eⱼ) = p̂ⱼ (where eⱼ is a one-hot vector with 1 in position j)
Log-likelihood: log P = log(p̂ⱼ) = Σᵢ yᵢ log(p̂ᵢ) (since yᵢ=1 only for the true class j)
Negative log-likelihood: −log P = −Σᵢ yᵢ log(p̂ᵢ) = cross-entropy loss
Summing over N observations: L = −(1/N)Σₙ Σᵢ yᵢₙ log(p̂ᵢₙ) — the standard training objective.
Minimizing cross-entropy loss = maximizing the Multinomial likelihood of the observed labels under the predicted probabilities.
Limitations
- Independence assumption. The Multinomial assumes each of the n trials is independent. In practice, documents in a batch may be correlated (same topic cluster, same time period) — the Multinomial PMF will underestimate variance in that case.
- Fixed probabilities. The Multinomial assumes pᵢ is fixed across all n trials. If the class distribution shifts during data collection (e.g., sports coverage increases mid-season), the Multinomial model is misspecified.
- Only joint counts, not joint ordering. The Multinomial is exchangeable — it treats (50 tech docs in any order) identically. Sequential structure in the data is not captured.
Test Your Understanding
-
The anchor classifier is evaluated on 100 documents with p=(0.50, 0.30, 0.20). Compute P(X₁=48, X₂=32, X₃=20) using the log-PMF formula. Is this probability higher or lower than P(X₁=50, X₂=30, X₃=20)? Why?
-
The variance-covariance matrix has rows that sum to zero. Prove this algebraically using the formulas Var(Xᵢ) = npᵢ(1−pᵢ) and Cov(Xᵢ, Xⱼ) = −npᵢpⱼ, and the fact that Σpᵢ = 1.
-
A classifier predicts probabilities (p̂₁, p̂₂, p̂₃) = (0.7, 0.2, 0.1) for a "tech" document (true class: tech). Compute the cross-entropy loss for this single example. Then compute it for a misclassified prediction (0.2, 0.5, 0.3). Why does cross-entropy penalize confident wrong predictions so heavily?
-
You observe 100 documents and want to test whether tech and sports are independent of each other (ignoring politics). Explain how the Binomial marginalization property lets you reduce this to a simpler problem and state what distribution you would use.
-
In Multinomial Naive Bayes, word counts for "the" appear 15 times in a 200-word document. The training data gives P("the" | tech) = 0.04 and P("the" | sports) = 0.03. Show how this one word contributes to the log-likelihood ratio log P(doc | tech) − log P(doc | sports).