Back to blog
← View series: statistics

~/blog

Multinomial Distribution

Apr 12, 202611 min readBy Mohammed Vasim
StatisticsMathData Science

When you evaluate a text classifier on 100 test documents and count how many land in each of three classes — tech, sports, politics — the resulting triple (X₁, X₂, X₃) follows a Multinomial distribution. It is the natural generalization of the Binomial from two outcomes to k outcomes, and it is the probabilistic foundation behind softmax classifiers, cross-entropy loss, and Multinomial Naive Bayes.

From Binomial to Multinomial

Binomial: n independent trials, each with 2 outcomes (success/failure), success probability p fixed.

Multinomial: n independent trials, each with k ≥ 2 outcomes, category probabilities (p₁, p₂, ..., pₖ) fixed, where Σpᵢ = 1.

The Binomial is Multinomial(n, k=2). Every formula generalizes:

PropertyBinomialMultinomial
Outcomes2k ≥ 2
Counts(X, n−X)(X₁, X₂, ..., Xₖ)
PMF coefficientC(n,x) = n!/(x!(n−x)!)n!/(x₁!x₂!...xₖ!)
Full PMFC(n,x)p^x(1−p)^{n−x}n!/(x₁!...xₖ!) × p₁^{x₁}×...×pₖ^{xₖ}
Meannpnpᵢ per category
Variancenp(1−p)npᵢ(1−pᵢ) per category
Covariance−npᵢpⱼ for i≠j

The DS/ML Anchor

A 3-class text classifier evaluated on 100 test documents:

n = 100 p₁ = 0.50 (tech) p₂ = 0.30 (sports) p₃ = 0.20 (politics) (X₁, X₂, X₃) ~ Multinomial(n=100, p=(0.50, 0.30, 0.20))

PMF

P(X₁=x₁, X₂=x₂, X₃=x₃) = n!/(x₁!x₂!x₃!) × p₁^{x₁} × p₂^{x₂} × p₃^{x₃}

subject to: x₁ + x₂ + x₃ = n, all xᵢ ≥ 0.

Deriving the multinomial coefficient: n!/(x₁!x₂!...xₖ!) counts the number of ways to arrange n distinguishable items into k groups of sizes x₁, x₂, ..., xₖ:

  • Choose x₁ positions from n for category 1: C(n, x₁) = n!/(x₁!(n−x₁)!)
  • Choose x₂ positions from remaining n−x₁ for category 2: C(n−x₁, x₂)
  • Remaining x₃ positions are forced to category 3: C(x₃, x₃) = 1

Multiplying: the (n−xᵢ)! factors cancel telescopically, leaving n!/(x₁!x₂!x₃!). This directly generalizes C(n,x) = n!/(x!(n−x)!): when k=2, the multinomial coefficient reduces to the binomial coefficient.

Why log computation? 100! ≈ 9.3×10¹⁵⁷ — direct computation overflows. Compute in log space:

log P = log(n!) − Σᵢ log(xᵢ!) + Σᵢ xᵢ log(pᵢ)

Example 1 — expected counts (x₁=50, x₂=30, x₃=20):

log P = [log(100!) − log(50!) − log(30!) − log(20!)] + [50×log(0.50) + 30×log(0.30) + 20×log(0.20)]

= [363.74 − 148.48 − 74.66 − 42.34] + [−34.66 + (−36.12) + (−32.19)]

= 98.26 + (−102.97) = −4.71

P(X₁=50, X₂=30, X₃=20) = e^{−4.71} ≈ 0.0090

This is the mode (the maximum of the PMF): the counts that exactly match the expected values E[Xᵢ] = npᵢ have the highest probability.

Example 2 — off-expected counts (x₁=55, x₂=28, x₃=17):

log P = [log(100!) − log(55!) − log(28!) − log(17!)] + [55×log(0.50) + 28×log(0.30) + 17×log(0.20)]

= [363.74 − 168.09 − 63.92 − 33.92] + [−38.12 + (−33.71) + (−27.36)]

= 97.81 + (−99.19) = −1.38...

Wait — that result contradicts the mode claim. Let me recalculate: 55+28+17=100 ✓. The multinomial coefficient for (55,28,17) vs (50,30,20) differs because we're putting 5 more items into category 1 and fewer into 2 and 3. The coefficient log ratio:

log[coeff(55,28,17)] − log[coeff(50,30,20)] = −[log(51)+...+log(55)] + [log(29)+log(30)] + [log(18)+log(19)+log(20)]

= −[3.932+3.951+3.970+3.989+4.007] + [3.367+3.401] + [2.890+2.944+2.996]

= −19.849 + 6.768 + 8.830 = −4.251

And the log-probability ratio:

[55−50]×log(0.5) + [28−30]×log(0.3) + [17−20]×log(0.2)

= 5×(−0.693) + (−2)×(−1.204) + (−3)×(−1.609)

= −3.466 + 2.408 + 4.827 = +3.769

log P ratio = −4.251 + 3.769 = −0.482 → P(55,28,17)/P(50,30,20) ≈ e^{−0.482} ≈ 0.617

P(X₁=55, X₂=28, X₃=17) ≈ 0.617 × 0.0090 ≈ 0.0056

The departure from expected counts reduces the probability by ~39%. The code below gives the exact values.

PMF — observed vs expected counts (n=100) count 60 30 0 Tech 50 55 Sports 30 28 Politics 20 17 Expected (50,30,20) — mode, P≈0.0090 Actual (55,28,17) — P≈0.0056

Mean, Variance, and Covariance

Each category's marginal is a Binomial (treating all other categories as "failure"), so Binomial formulas apply directly:

E[Xᵢ] = n × pᵢ

Var(Xᵢ) = n × pᵢ × (1−pᵢ)

Cov(Xᵢ, Xⱼ) = −n × pᵢ × pⱼ for i ≠ j

Why is covariance always negative? Because X₁ + X₂ + X₃ = n is a fixed constraint. If tech gets more documents than expected (X₁ > E[X₁] = 50), those extra documents must come from somewhere — some other category must have fewer. The counts compete for the same n trials, so they are negatively correlated by construction.

This has direct implications for multi-class model evaluation: if a classifier improves recall on tech, it will, on average, see a reduction elsewhere — not because of model quality but because of the sum constraint.

Computed on anchor (n=100, p=(0.50, 0.30, 0.20)):

CategoryE[Xᵢ] = npᵢVar(Xᵢ) = npᵢ(1−pᵢ)SD
Tech (i=1)100×0.50 = 50.0100×0.50×0.50 = 25.05.0
Sports (i=2)100×0.30 = 30.0100×0.30×0.70 = 21.04.58
Politics (i=3)100×0.20 = 20.0100×0.20×0.80 = 16.04.0

Covariances:

  • Cov(X₁, X₂) = −100 × 0.50 × 0.30 = −15.0
  • Cov(X₁, X₃) = −100 × 0.50 × 0.20 = −10.0
  • Cov(X₂, X₃) = −100 × 0.30 × 0.20 = −6.0

Variance-covariance matrix:

Note: Σ of each row = 25−15−10 = 0. This is required because X₁+X₂+X₃=n has zero variance — any row/column sums to zero.

Variance-Covariance Matrix Tech Sports Politics Tech Sports Politics 25.0 21.0 16.0 −15.0 −10.0 −15.0 −6.0 −10.0 −6.0

Marginalization to Binomial

Any single category's marginal distribution is Binomial. Formally, summing the Multinomial PMF over all values of x₃ such that x₃ = n−x₁−x₂:

Σ_{x₃≥0} P(X₁=x₁, X₂=x₂, X₃=n−x₁−x₂) = [n!/(x₁!x₂!(n−x₁−x₂)!)] × p₁^{x₁} × p₂^{x₂} × p₃^{n−x₁−x₂}

Summing further over x₂ (treating categories 2 and 3 together as "not category 1"):

Σ_{x₂} [above] = C(n, x₁) × p₁^{x₁} × (p₂+p₃)^{n−x₁} = C(n, x₁) × p₁^{x₁} × (1−p₁)^{n−x₁}

This is exactly Binomial(n, p₁). The Multinomial "collapses" to a Binomial whenever you ignore all but one category.

Similarly, the combined count (X₁+X₂) ~ Binomial(n, p₁+p₂) — treating "tech or sports" as one category vs "politics."

Numerical verification: P(X₁ > 55 | n=100, p₁=0.50) using Binomial(100, 0.50):

P(X₁ > 55) = 1 − Binomial_CDF(55; n=100, p=0.50) = 1 − 0.8643 = 0.1357

About 13.6% of the time, tech will get more than 55 documents even if the true probability is exactly 0.50.

Two Sampling Interpretations

The Multinomial has two equivalent ways to think about generating a sample:

  1. Drawing with replacement: draw n items from a population. Each item independently belongs to category i with probability pᵢ. Count how many from each category.

  2. Sequential assignment: for each of n documents, assign it to tech/sports/politics with probabilities (0.50, 0.30, 0.20). Count the assignments.

Both produce the same distribution. In practice, np.random.multinomial(n, p) implements interpretation 2.

Code

python
import numpy as np
from scipy import stats

n = 100
p = [0.50, 0.30, 0.20]
labels = ["tech", "sports", "politics"]

# Sample one draw from Multinomial
rng = np.random.default_rng(42)
sample = rng.multinomial(n, p)
print("Single sample:", dict(zip(labels, sample)))

# PMF at expected counts using scipy
pmf_expected = stats.multinomial.pmf([50, 30, 20], n=n, p=p)
pmf_actual   = stats.multinomial.pmf([55, 28, 17], n=n, p=p)
print(f"\nP(50, 30, 20) = {pmf_expected:.6f}  (log = {np.log(pmf_expected):.4f})")
print(f"P(55, 28, 17) = {pmf_actual:.6f}  (log = {np.log(pmf_actual):.4f})")

# Manual log-PMF
counts_ex = [50, 30, 20]
log_pmf = (
    np.sum(np.log(np.arange(1, n+1))) -
    np.sum([np.sum(np.log(np.arange(1, c+1))) for c in counts_ex]) +
    np.sum([c * np.log(pi) for c, pi in zip(counts_ex, p)])
)
print(f"\nManual log P(50,30,20) = {log_pmf:.4f}")

# Simulate 10,000 draws and verify means and covariances
samples = rng.multinomial(n, p, size=10_000)
print(f"\nEmpirical means:  {samples.mean(axis=0)}")   # [50, 30, 20]
print(f"Expected means:   {[n*pi for pi in p]}")
print(f"\nEmpirical Cov(X1,X2): {np.cov(samples[:,0], samples[:,1])[0,1]:.2f}")
print(f"Theoretical:          {-n*p[0]*p[1]:.2f}")

# Marginalization: X1 ~ Binomial(n, p1)
p_x1_gt55 = 1 - stats.binom.cdf(55, n=n, p=p[0])
print(f"\nP(X1 > 55 | Binomial) = {p_x1_gt55:.4f}")
Single sample: {'tech': 52, 'sports': 28, 'politics': 20} P(50, 30, 20) = 0.009006 (log = -4.7098) P(55, 28, 17) = 0.005554 (log = -5.1944) Manual log P(50,30,20) = -4.7098 Empirical means: [50.01 30.03 19.96] Expected means: [50.0, 30.0, 20.0] Empirical Cov(X1,X2): -14.97 Theoretical: -15.00 P(X1 > 55 | Binomial) = 0.1357

ML Applications

1 — Multinomial Naive Bayes for Text

In Multinomial Naive Bayes, a document is modeled as a bag-of-words count vector (x₁, x₂, ..., x_V) where xᵢ is how many times word i appears. Given class c, this count vector is assumed to follow Multinomial(n, p₁|c, ..., p_V|c) where pᵢ|c = P(word i | class c).

The class-conditional log-likelihood: log P(doc | class c) = log[n!/∏xᵢ!] + Σᵢ xᵢ log(pᵢ|c)

The log[n!/∏xᵢ!] term is the same for all classes and cancels in the argmax — so classification reduces to comparing Σᵢ xᵢ log(pᵢ|c) across classes.

2 — Softmax Output Layer

A neural network's final softmax layer outputs p̂ = (p̂₁, ..., p̂ₖ): a Multinomial probability vector for a single trial. The cross-entropy loss for one observation is:

L = −Σᵢ yᵢ log(p̂ᵢ)

where yᵢ is the one-hot true label (yᵢ=1 for the true class, 0 elsewhere). Since only one yᵢ=1:

L = −log(p̂_{true class})

This is exactly the negative log-likelihood of a Multinomial(n=1, p̂) observation — the probability of seeing exactly the true class. Summing over the dataset gives the total negative log-likelihood. Training with cross-entropy loss is maximum likelihood estimation of a Multinomial model. This is not a design choice — it follows directly from modeling each prediction as a Multinomial trial.

3 — Dirichlet-Multinomial (Bayesian Text Modeling)

The Dirichlet distribution is the conjugate prior for Multinomial parameters. If p ~ Dirichlet(α₁, ..., αₖ) and X|p ~ Multinomial(n, p), then after observing counts (x₁, ..., xₖ):

p | X ~ Dirichlet(α₁+x₁, α₂+x₂, ..., αₖ+xₖ)

This is the basis for Laplace smoothing in Naive Bayes: adding αᵢ=1 to each count before normalizing corresponds to a Dirichlet(1,1,...,1) prior — the uniform distribution over the simplex.

4 — Chi-Square Independence Test

Under the independence hypothesis H₀, the observed contingency table cell counts follow a Multinomial distribution with probabilities pᵢⱼ = pᵢ × pⱼ (row marginal × column marginal). The chi-square statistic Σ (O−E)²/E is an approximation to the likelihood ratio test for a Multinomial model. The Multinomial distribution is the theoretical foundation that makes the chi-square test valid.

Cross-Entropy IS Multinomial Negative Log-Likelihood

The connection is exact, not analogical. For a single observation (n=1 Multinomial trial):

Multinomial PMF: P(X=eⱼ) = p̂ⱼ (where eⱼ is a one-hot vector with 1 in position j)

Log-likelihood: log P = log(p̂ⱼ) = Σᵢ yᵢ log(p̂ᵢ) (since yᵢ=1 only for the true class j)

Negative log-likelihood: −log P = −Σᵢ yᵢ log(p̂ᵢ) = cross-entropy loss

Summing over N observations: L = −(1/N)Σₙ Σᵢ yᵢₙ log(p̂ᵢₙ) — the standard training objective.

Minimizing cross-entropy loss = maximizing the Multinomial likelihood of the observed labels under the predicted probabilities.

Limitations

  • Independence assumption. The Multinomial assumes each of the n trials is independent. In practice, documents in a batch may be correlated (same topic cluster, same time period) — the Multinomial PMF will underestimate variance in that case.
  • Fixed probabilities. The Multinomial assumes pᵢ is fixed across all n trials. If the class distribution shifts during data collection (e.g., sports coverage increases mid-season), the Multinomial model is misspecified.
  • Only joint counts, not joint ordering. The Multinomial is exchangeable — it treats (50 tech docs in any order) identically. Sequential structure in the data is not captured.

Test Your Understanding

  1. The anchor classifier is evaluated on 100 documents with p=(0.50, 0.30, 0.20). Compute P(X₁=48, X₂=32, X₃=20) using the log-PMF formula. Is this probability higher or lower than P(X₁=50, X₂=30, X₃=20)? Why?

  2. The variance-covariance matrix has rows that sum to zero. Prove this algebraically using the formulas Var(Xᵢ) = npᵢ(1−pᵢ) and Cov(Xᵢ, Xⱼ) = −npᵢpⱼ, and the fact that Σpᵢ = 1.

  3. A classifier predicts probabilities (p̂₁, p̂₂, p̂₃) = (0.7, 0.2, 0.1) for a "tech" document (true class: tech). Compute the cross-entropy loss for this single example. Then compute it for a misclassified prediction (0.2, 0.5, 0.3). Why does cross-entropy penalize confident wrong predictions so heavily?

  4. You observe 100 documents and want to test whether tech and sports are independent of each other (ignoring politics). Explain how the Binomial marginalization property lets you reduce this to a simpler problem and state what distribution you would use.

  5. In Multinomial Naive Bayes, word counts for "the" appear 15 times in a 200-word document. The training data gives P("the" | tech) = 0.04 and P("the" | sports) = 0.03. Show how this one word contributes to the log-likelihood ratio log P(doc | tech) − log P(doc | sports).

Comments (0)

No comments yet. Be the first to comment!

Leave a comment