Back to blog
← View series: machine learning
Machine Learning

~/blog

Variants of Naive Bayes

Jun 26, 20267 min readBy Mohammed Vasim
Machine LearningAIData Science

Bayes theorem gives the posterior . To make this computable, you need a model for the likelihood — how features are distributed given the class. Three different assumptions about this distribution give three different classifiers: Multinomial NB for count data, Bernoulli NB for binary presence/absence, and Gaussian NB for continuous measurements.

Anchor dataset: Email spam classification. Four training emails, one test email.

python
# Training data: word counts per email, label (1=spam, 0=ham)
train_emails = [
    {"free": 3, "money": 2, "hello": 0, "meeting": 1},  # spam
    {"free": 0, "money": 0, "hello": 2, "meeting": 3},  # ham
    {"free": 2, "money": 1, "hello": 0, "meeting": 0},  # spam
    {"free": 0, "money": 0, "hello": 1, "meeting": 2},  # ham
]
train_labels = [1, 0, 1, 0]

test_email = {"free": 1, "money": 1, "hello": 0, "meeting": 0}

The Naive Assumption

Naive Bayes classifies by computing:

The product form comes from assuming conditional independence: given the class, each feature is independent of all others.

This is almost certainly violated in real data — "free" and "money" co-occur in spam far more than independence would predict. But the class prediction (argmax over ) is often correct even when individual probabilities are off. Naive Bayes can be thought of as a linear classifier in log-space:

This is linear in features — the same functional form as logistic regression, but with parameters set analytically from counts rather than gradient descent.

Variant 1: Multinomial Naive Bayes

Multinomial NB treats each feature as a word count. The likelihood is the probability of observing this bag-of-words given the class.

Step 1: Class priors

Step 2: Word likelihoods with Laplace smoothing

Count total word occurrences per class. Vocabulary free, money, hello, meeting, .

With Laplace smoothing ():

Spam emails (emails 1 and 3): free=3+2=5, money=2+1=3, hello=0, meeting=1+0=1. Total = 9. With smoothing: denominator = 9 + 4 = 13.

WordSpam countP(word|spam)Ham countP(word|ham)
free50
money30
hello03
meeting15

Ham denominator: total words = 0+0+2+3+0+0+1+2 = 8. With smoothing: 8 + 4 = 12.

Step 3: Classify the test email {free:1, money:1, hello:0, meeting:0}

For Multinomial NB:

Zero-count words contribute — absent words are ignored.

Unnormalized posteriors:

Normalized: SPAM

The email mentions "free" and "money" — two words almost exclusive to spam in this training set. The result matches intuition.

Variant 2: Bernoulli Naive Bayes

Bernoulli NB converts word counts to binary presence/absence. It treats each word as a separate Bernoulli variable: 1 if present, 0 if absent.

Step 1: Binary conversion

EmailfreemoneyhellomeetingLabel
11101spam
20011ham
31100spam
40011ham

Step 2: Word presence likelihoods (2 spam emails, 2 ham emails; , denominator )

WordP(=1|spam)P(=1|ham)
free
money
hello
meeting

Step 3: Classify test email {free:1, money:1, hello:0, meeting:0}

The critical difference: Bernoulli NB uses both present and absent words. Absent words contribute .

Unnormalized posteriors:

Normalized: SPAM

The absent words (hello=0, meeting=0) actually increased confidence in spam here: ham emails always contain hello and meeting, so their absence is evidence against ham. Multinomial NB silently ignored them.

Variant 3: Gaussian Naive Bayes

For continuous features, assume each feature follows a Gaussian distribution within each class:

Parameters and are estimated from training data (mean and variance of feature in class ).

Iris example — sepal length:

Approximate parameters from the Iris dataset:

  • Setosa: ,
  • Versicolor: ,

For a test sample with sepal_length = 5.5:

Sepal length alone slightly favors Versicolor. Combined with petal_length likelihood (which separates Setosa sharply), the joint posterior correctly classifies most Iris samples.

Laplace Smoothing — The Zero Probability Problem

Without smoothing: if "bitcoin" never appears in 1,000 spam training emails:

Any email containing "bitcoin" gets , regardless of every other word. One unseen word destroys the classification.

Laplace smoothing adds to every word count:

: add-1 (uniform) smoothing. : smaller pseudocount — less aggressively uniform. The right is a hyperparameter; cross-validate it (post 03).

Multinomial vs Bernoulli NB

AspectMultinomial NBBernoulli NB
Feature typeWord counts (or TF)Binary presence (0/1)
Absent wordsIgnored ()Penalized: term
Uses word frequencyYesNo — only presence matters
Better forLong documentsShort texts, boolean features
Typical useNews categorization, TF-IDFSpam detection, boolean attributes

The absent-word difference is the practical distinction: Bernoulli penalizes words that are characteristic of the class but don't appear, which can hurt or help depending on the task.

Variants Summary

VariantFeature typeLikelihood modelUse case
Gaussian NBContinuousGaussian PDF per feature per classIris, sensor data, real-valued measurements
Multinomial NBCounts / frequenciesCount ratio with Laplace smoothingDocument classification, TF matrices
Bernoulli NBBinary 0/1Bernoulli with present/absent termsShort-text spam, boolean attributes

Test Your Understanding

  1. Multinomial NB ignores absent words (). If you add a fifth word "bitcoin" to the vocabulary but it appears zero times in training, does Laplace smoothing change the classification of the test email {free:1, money:1}? Why or why not?

  2. In Bernoulli NB, the absent word "meeting" contributed to the spam likelihood. In Multinomial NB, "meeting" contributes . Which treatment is more conservative (less confident) about spam, and why?

  3. Gaussian NB assumes features are normally distributed within each class. The Iris dataset has sepal_length near-Gaussian, but word counts in text are typically Poisson or power-law distributed. What happens to the likelihood estimates if you apply Gaussian NB to word count features?

  4. Laplace smoothing with adds the same count to every word. If your vocabulary has 50,000 words but only 500 are ever seen in spam, how does vs affect rare vs common word probabilities?

  5. Both Multinomial NB and Bernoulli NB classify the test email as spam, but with different confidences (0.953 vs 0.982). These probabilities are not calibrated — they overstate confidence. Why does the independence assumption inflate posterior probabilities toward 0 and 1?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment