← View series: machine learning
~/blog
Variants of Naive Bayes
Bayes theorem gives the posterior . To make this computable, you need a model for the likelihood — how features are distributed given the class. Three different assumptions about this distribution give three different classifiers: Multinomial NB for count data, Bernoulli NB for binary presence/absence, and Gaussian NB for continuous measurements.
Anchor dataset: Email spam classification. Four training emails, one test email.
# Training data: word counts per email, label (1=spam, 0=ham)
train_emails = [
{"free": 3, "money": 2, "hello": 0, "meeting": 1}, # spam
{"free": 0, "money": 0, "hello": 2, "meeting": 3}, # ham
{"free": 2, "money": 1, "hello": 0, "meeting": 0}, # spam
{"free": 0, "money": 0, "hello": 1, "meeting": 2}, # ham
]
train_labels = [1, 0, 1, 0]
test_email = {"free": 1, "money": 1, "hello": 0, "meeting": 0}The Naive Assumption
Naive Bayes classifies by computing:
The product form comes from assuming conditional independence: given the class, each feature is independent of all others.
This is almost certainly violated in real data — "free" and "money" co-occur in spam far more than independence would predict. But the class prediction (argmax over ) is often correct even when individual probabilities are off. Naive Bayes can be thought of as a linear classifier in log-space:
This is linear in features — the same functional form as logistic regression, but with parameters set analytically from counts rather than gradient descent.
Variant 1: Multinomial Naive Bayes
Multinomial NB treats each feature as a word count. The likelihood is the probability of observing this bag-of-words given the class.
Step 1: Class priors
Step 2: Word likelihoods with Laplace smoothing
Count total word occurrences per class. Vocabulary free, money, hello, meeting, .
With Laplace smoothing ():
Spam emails (emails 1 and 3): free=3+2=5, money=2+1=3, hello=0, meeting=1+0=1. Total = 9. With smoothing: denominator = 9 + 4 = 13.
| Word | Spam count | P(word|spam) | Ham count | P(word|ham) |
|---|---|---|---|---|
| free | 5 | 0 | ||
| money | 3 | 0 | ||
| hello | 0 | 3 | ||
| meeting | 1 | 5 |
Ham denominator: total words = 0+0+2+3+0+0+1+2 = 8. With smoothing: 8 + 4 = 12.
Step 3: Classify the test email {free:1, money:1, hello:0, meeting:0}
For Multinomial NB:
Zero-count words contribute — absent words are ignored.
Unnormalized posteriors:
Normalized: → SPAM
The email mentions "free" and "money" — two words almost exclusive to spam in this training set. The result matches intuition.
Variant 2: Bernoulli Naive Bayes
Bernoulli NB converts word counts to binary presence/absence. It treats each word as a separate Bernoulli variable: 1 if present, 0 if absent.
Step 1: Binary conversion
| free | money | hello | meeting | Label | |
|---|---|---|---|---|---|
| 1 | 1 | 1 | 0 | 1 | spam |
| 2 | 0 | 0 | 1 | 1 | ham |
| 3 | 1 | 1 | 0 | 0 | spam |
| 4 | 0 | 0 | 1 | 1 | ham |
Step 2: Word presence likelihoods (2 spam emails, 2 ham emails; , denominator )
| Word | P(=1|spam) | P(=1|ham) |
|---|---|---|
| free | ||
| money | ||
| hello | ||
| meeting |
Step 3: Classify test email {free:1, money:1, hello:0, meeting:0}
The critical difference: Bernoulli NB uses both present and absent words. Absent words contribute .
Unnormalized posteriors:
Normalized: → SPAM
The absent words (hello=0, meeting=0) actually increased confidence in spam here: ham emails always contain hello and meeting, so their absence is evidence against ham. Multinomial NB silently ignored them.
Variant 3: Gaussian Naive Bayes
For continuous features, assume each feature follows a Gaussian distribution within each class:
Parameters and are estimated from training data (mean and variance of feature in class ).
Iris example — sepal length:
Approximate parameters from the Iris dataset:
- Setosa: ,
- Versicolor: ,
For a test sample with sepal_length = 5.5:
Sepal length alone slightly favors Versicolor. Combined with petal_length likelihood (which separates Setosa sharply), the joint posterior correctly classifies most Iris samples.
Laplace Smoothing — The Zero Probability Problem
Without smoothing: if "bitcoin" never appears in 1,000 spam training emails:
Any email containing "bitcoin" gets , regardless of every other word. One unseen word destroys the classification.
Laplace smoothing adds to every word count:
: add-1 (uniform) smoothing. : smaller pseudocount — less aggressively uniform. The right is a hyperparameter; cross-validate it (post 03).
Multinomial vs Bernoulli NB
| Aspect | Multinomial NB | Bernoulli NB |
|---|---|---|
| Feature type | Word counts (or TF) | Binary presence (0/1) |
| Absent words | Ignored () | Penalized: term |
| Uses word frequency | Yes | No — only presence matters |
| Better for | Long documents | Short texts, boolean features |
| Typical use | News categorization, TF-IDF | Spam detection, boolean attributes |
The absent-word difference is the practical distinction: Bernoulli penalizes words that are characteristic of the class but don't appear, which can hurt or help depending on the task.
Variants Summary
| Variant | Feature type | Likelihood model | Use case |
|---|---|---|---|
| Gaussian NB | Continuous | Gaussian PDF per feature per class | Iris, sensor data, real-valued measurements |
| Multinomial NB | Counts / frequencies | Count ratio with Laplace smoothing | Document classification, TF matrices |
| Bernoulli NB | Binary 0/1 | Bernoulli with present/absent terms | Short-text spam, boolean attributes |
Test Your Understanding
-
Multinomial NB ignores absent words (). If you add a fifth word "bitcoin" to the vocabulary but it appears zero times in training, does Laplace smoothing change the classification of the test email {free:1, money:1}? Why or why not?
-
In Bernoulli NB, the absent word "meeting" contributed to the spam likelihood. In Multinomial NB, "meeting" contributes . Which treatment is more conservative (less confident) about spam, and why?
-
Gaussian NB assumes features are normally distributed within each class. The Iris dataset has sepal_length near-Gaussian, but word counts in text are typically Poisson or power-law distributed. What happens to the likelihood estimates if you apply Gaussian NB to word count features?
-
Laplace smoothing with adds the same count to every word. If your vocabulary has 50,000 words but only 500 are ever seen in spam, how does vs affect rare vs common word probabilities?
-
Both Multinomial NB and Bernoulli NB classify the test email as spam, but with different confidences (0.953 vs 0.982). These probabilities are not calibrated — they overstate confidence. Why does the independence assumption inflate posterior probabilities toward 0 and 1?