~/blog

Variants of Naive Bayes

Jun 26, 2026•7 min read•By Mohammed Vasim

Machine LearningAIData Science

Bayes theorem gives the posterior $P (y ∣ x) \propto P (y) \times P (x ∣ y)$ . To make this computable, you need a model for the likelihood $P (x ∣ y)$ — how features are distributed given the class. Three different assumptions about this distribution give three different classifiers: Multinomial NB for count data, Bernoulli NB for binary presence/absence, and Gaussian NB for continuous measurements.

Anchor dataset: Email spam classification. Four training emails, one test email.

python

# Training data: word counts per email, label (1=spam, 0=ham)
train_emails = [
    {"free": 3, "money": 2, "hello": 0, "meeting": 1},  # spam
    {"free": 0, "money": 0, "hello": 2, "meeting": 3},  # ham
    {"free": 2, "money": 1, "hello": 0, "meeting": 0},  # spam
    {"free": 0, "money": 0, "hello": 1, "meeting": 2},  # ham
]
train_labels = [1, 0, 1, 0]

test_email = {"free": 1, "money": 1, "hello": 0, "meeting": 0}

The Naive Assumption

Naive Bayes classifies by computing:

$P (y ∣ x) \propto P (y) \times \prod_{j = 1}^{p} P (x_{j} ∣ y)$

The product form comes from assuming conditional independence: given the class, each feature is independent of all others. $P (x_{1}, x_{2}, \dots ∣ y) = P (x_{1} ∣ y) \times P (x_{2} ∣ y) \times \dots$

This is almost certainly violated in real data — "free" and "money" co-occur in spam far more than independence would predict. But the class prediction (argmax over $y$ ) is often correct even when individual probabilities are off. Naive Bayes can be thought of as a linear classifier in log-space:

$lo g P (y ∣ x) = lo g P (y) + \sum_{j} x_{j} lo g P (x_{j} ∣ y) + const$

This is linear in features $x_{j}$ — the same functional form as logistic regression, but with parameters set analytically from counts rather than gradient descent.

Variant 1: Multinomial Naive Bayes

Multinomial NB treats each feature as a word count. The likelihood $P (x ∣ y)$ is the probability of observing this bag-of-words given the class.

Step 1: Class priors

$P (spam) = 2/4 = 0.5, P (ham) = 2/4 = 0.5$

Step 2: Word likelihoods with Laplace smoothing

Count total word occurrences per class. Vocabulary $V = {$ free, money, hello, meeting $}$ , $∣ V ∣ = 4$ .

With Laplace smoothing ( $α = 1$ ): $P (word ∣ class) = \frac{count ( word, class ) + 1}{total words in class + ∣ V ∣}$

Spam emails (emails 1 and 3): free=3+2=5, money=2+1=3, hello=0, meeting=1+0=1. Total = 9. With smoothing: denominator = 9 + 4 = 13.

Word	Spam count	P(word\|spam)	Ham count	P(word\|ham)
free	5	$(5 + 1) /13 = 6/13 = 0.462$	0	$(0 + 1) /12 = 1/12 = 0.083$
money	3	$(3 + 1) /13 = 4/13 = 0.308$	0	$(0 + 1) /12 = 1/12 = 0.083$
hello	0	$(0 + 1) /13 = 1/13 = 0.077$	3	$(3 + 1) /12 = 4/12 = 0.333$
meeting	1	$(1 + 1) /13 = 2/13 = 0.154$	5	$(5 + 1) /12 = 6/12 = 0.500$

Ham denominator: total words = 0+0+2+3+0+0+1+2 = 8. With smoothing: 8 + 4 = 12.

Step 3: Classify the test email {free:1, money:1, hello:0, meeting:0}

For Multinomial NB: $P (email ∣ y) = \prod_{j} P (word_{j} ∣ y)^{count_{j}}$

Zero-count words contribute $P (word ∣ y)^{0} = 1$ — absent words are ignored.

$P (email ∣ spam) = 0.46 2^{1} \times 0.30 8^{1} \times 0.07 7^{0} \times 0.15 4^{0} = 0.462 \times 0.308 = 0.142$

$P (email ∣ ham) = 0.08 3^{1} \times 0.08 3^{1} \times 0.33 3^{0} \times 0.50 0^{0} = 0.083 \times 0.083 = 0.007$

Unnormalized posteriors: $P (spam ∣ email) \propto 0.5 \times 0.142 = 0.071$ $P (ham ∣ email) \propto 0.5 \times 0.007 = 0.0035$

Normalized: $P (spam) = 0.071/ (0.071 + 0.0035) = 0.953$ → SPAM

The email mentions "free" and "money" — two words almost exclusive to spam in this training set. The result matches intuition.

Variant 2: Bernoulli Naive Bayes

Bernoulli NB converts word counts to binary presence/absence. It treats each word as a separate Bernoulli variable: 1 if present, 0 if absent.

Step 1: Binary conversion

Email	free	money	hello	meeting	Label
1	1	1	0	1	spam
2	0	0	1	1	ham
3	1	1	0	0	spam
4	0	0	1	1	ham

Step 2: Word presence likelihoods (2 spam emails, 2 ham emails; $α = 1$ , denominator $n_{c} + 2$ )

$P (word = 1 ∣ class) = \frac{count present in class + 1}{n _{c} + 2}$

Word	P(=1\|spam)	P(=1\|ham)
free	$(2 + 1) / (2 + 2) = 3/4 = 0.75$	$(0 + 1) / (2 + 2) = 1/4 = 0.25$
money	$(2 + 1) / (2 + 2) = 3/4 = 0.75$	$(0 + 1) / (2 + 2) = 1/4 = 0.25$
hello	$(0 + 1) / (2 + 2) = 1/4 = 0.25$	$(2 + 1) / (2 + 2) = 3/4 = 0.75$
meeting	$(1 + 1) / (2 + 2) = 2/4 = 0.50$	$(2 + 1) / (2 + 2) = 3/4 = 0.75$

Step 3: Classify test email {free:1, money:1, hello:0, meeting:0}

The critical difference: Bernoulli NB uses both present and absent words. Absent words contribute $P (word = 0 ∣ y) = 1 - P (word = 1 ∣ y)$ .

$P (email ∣ spam) = P (free = 1) \times P (money = 1) \times P (hello = 0) \times P (meeting = 0)$ $= 0.75 \times 0.75 \times (1 - 0.25) \times (1 - 0.50) = 0.75 \times 0.75 \times 0.75 \times 0.50 = 0.211$

$P (email ∣ ham) = P (free = 1) \times P (money = 1) \times P (hello = 0) \times P (meeting = 0)$ $= 0.25 \times 0.25 \times (1 - 0.75) \times (1 - 0.75) = 0.25 \times 0.25 \times 0.25 \times 0.25 = 0.0039$

Unnormalized posteriors:

$P (spam) \propto 0.5 \times 0.211 = 0.1055$
$P (ham) \propto 0.5 \times 0.0039 = 0.00195$

Normalized: $P (spam) = 0.1055/ (0.1055 + 0.00195) = 0.982$ → SPAM

The absent words (hello=0, meeting=0) actually increased confidence in spam here: ham emails always contain hello and meeting, so their absence is evidence against ham. Multinomial NB silently ignored them.

Variant 3: Gaussian Naive Bayes

For continuous features, assume each feature follows a Gaussian distribution within each class:

$P (x_{j} ∣ y) = \frac{1}{2 π σ _{j y}^{2}} exp (- \frac{( x _{j} - μ _{j y} ) ^{2}}{2 σ _{j y}^{2}})$

Parameters $μ_{j y}$ and $σ_{j y}^{2}$ are estimated from training data (mean and variance of feature $j$ in class $y$ ).

Iris example — sepal length:

Approximate parameters from the Iris dataset:

Setosa: $μ = 5.01$ , $σ = 0.35$
Versicolor: $μ = 5.94$ , $σ = 0.51$

For a test sample with sepal_length = 5.5:

$P (5.5 ∣ Setosa) = \frac{1}{2 π ( 0.35 ) ^{2}} exp (- \frac{( 5.5 - 5.01 ) ^{2}}{2 ( 0.35 ) ^{2}}) = \frac{1}{0.877} exp (- 0.980) = 0.421$

$P (5.5 ∣ Versicolor) = \frac{1}{2 π ( 0.51 ) ^{2}} exp (- \frac{( 5.5 - 5.94 ) ^{2}}{2 ( 0.51 ) ^{2}}) = \frac{1}{1.277} exp (- 0.372) = 0.523$

Sepal length alone slightly favors Versicolor. Combined with petal_length likelihood (which separates Setosa sharply), the joint posterior correctly classifies most Iris samples.

Laplace Smoothing — The Zero Probability Problem

Without smoothing: if "bitcoin" never appears in 1,000 spam training emails:

$P (bitcoin ∣ spam) = 0/1000 = 0$

Any email containing "bitcoin" gets $P (spam ∣ email) = P (spam) \times 0 \times \dots = 0$ , regardless of every other word. One unseen word destroys the classification.

Laplace smoothing adds $α$ to every word count:

$P (word ∣ class) = \frac{count ( word, class ) + α}{total words in class + α \times ∣ V ∣}$

$α = 1$ : add-1 (uniform) smoothing. $α < 1$ : smaller pseudocount — less aggressively uniform. The right $α$ is a hyperparameter; cross-validate it (post 03).

Multinomial vs Bernoulli NB

Aspect	Multinomial NB	Bernoulli NB
Feature type	Word counts (or TF)	Binary presence (0/1)
Absent words	Ignored ( $P^{0} = 1$ )	Penalized: $(1 - P)$ term
Uses word frequency	Yes	No — only presence matters
Better for	Long documents	Short texts, boolean features
Typical use	News categorization, TF-IDF	Spam detection, boolean attributes

The absent-word difference is the practical distinction: Bernoulli penalizes words that are characteristic of the class but don't appear, which can hurt or help depending on the task.

Variants Summary

Variant	Feature type	Likelihood model	Use case
Gaussian NB	Continuous	Gaussian PDF per feature per class	Iris, sensor data, real-valued measurements
Multinomial NB	Counts / frequencies	Count ratio with Laplace smoothing	Document classification, TF matrices
Bernoulli NB	Binary 0/1	Bernoulli with present/absent terms	Short-text spam, boolean attributes

Test Your Understanding

Multinomial NB ignores absent words ( $P^{0} = 1$ ). If you add a fifth word "bitcoin" to the vocabulary but it appears zero times in training, does Laplace smoothing change the classification of the test email {free:1, money:1}? Why or why not?
In Bernoulli NB, the absent word "meeting" contributed $(1 - P (meeting = 1 ∣ spam)) = 1 - 0.50 = 0.50$ to the spam likelihood. In Multinomial NB, "meeting" contributes $0.15 4^{0} = 1$ . Which treatment is more conservative (less confident) about spam, and why?
Gaussian NB assumes features are normally distributed within each class. The Iris dataset has sepal_length near-Gaussian, but word counts in text are typically Poisson or power-law distributed. What happens to the likelihood estimates if you apply Gaussian NB to word count features?
Laplace smoothing with $α = 1$ adds the same count to every word. If your vocabulary has 50,000 words but only 500 are ever seen in spam, how does $α = 1$ vs $α = 0.01$ affect rare vs common word probabilities?
Both Multinomial NB and Bernoulli NB classify the test email as spam, but with different confidences (0.953 vs 0.982). These probabilities are not calibrated — they overstate confidence. Why does the independence assumption inflate posterior probabilities toward 0 and 1?