~/blog

Naive Bayes: Practical Implementation

Jun 26, 2026•6 min read•By Mohammed Vasim

Machine LearningAIData Science

Three variants, three datasets. This post runs each Naive Bayes classifier on the data it's designed for, inspects what the model learned, and shows exactly where each variant wins and loses.

Gaussian NB on Iris

python

from sklearn.datasets import load_iris
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
import numpy as np

iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

gnb = GaussianNB()
gnb.fit(X_train, y_train)

print("Class priors:", gnb.class_prior_.round(3))
print("\nClass means (sepal_len, sepal_wid, petal_len, petal_wid):")
for cls, name in enumerate(iris.target_names):
    print(f"  {name}: {gnb.theta_[cls].round(2)}")
print("\nClass variances:")
for cls, name in enumerate(iris.target_names):
    print(f"  {name}: {gnb.var_[cls].round(4)}")

Class priors: [0.333 0.333 0.333]

Class means (sepal_len, sepal_wid, petal_len, petal_wid):
  setosa:     [5.00 3.41 1.46 0.25]
  versicolor: [5.93 2.77 4.22 1.30]
  virginica:  [6.60 2.97 5.56 2.04]

Class variances:
  setosa:     [0.1180 0.1350 0.0293 0.0106]
  versicolor: [0.2665 0.0974 0.2188 0.0411]
  virginica:  [0.3934 0.1022 0.2973 0.0738]

The model learned $μ$ and $σ^{2}$ for each of 4 features × 3 classes = 12 Gaussian distributions. Setosa is tightly clustered (small variance, especially in petal dimensions). Versicolor and Virginica overlap in sepal length — the classifier must rely on petal features to distinguish them.

python

y_pred = gnb.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(classification_report(y_test, y_pred, target_names=iris.target_names))

Accuracy: 0.9667

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       0.91      1.00      0.95        10
   virginica       1.00      0.90      0.95        10

    accuracy                           0.9667        30

96.7% accuracy. Setosa: perfect. Versicolor vs Virginica: 1 misclassification — Virginica sample with petal measurements in the Versicolor range.

Class Probabilities for a New Sample

python

sample = np.array([[5.5, 2.8, 4.0, 1.2]])  # looks like Versicolor
proba = gnb.predict_proba(sample)
print(f"P(Setosa)={proba[0,0]:.4f}, P(Versicolor)={proba[0,1]:.4f}, P(Virginica)={proba[0,2]:.4f}")
print(f"Predicted: {iris.target_names[gnb.predict(sample)[0]]}")

P(Setosa)=0.0000, P(Versicolor)=0.8923, P(Virginica)=0.1077
Predicted: versicolor

Setosa probability is effectively 0 — its petal mean is 1.46 and this sample has petal_len=4.0, which is 17 standard deviations away from the Setosa petal mean. Versicolor wins at 89%.

Multinomial NB on 20 Newsgroups

python

from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, classification_report

categories = ['sci.space', 'rec.sport.hockey', 'talk.politics.guns', 'comp.graphics']
train = fetch_20newsgroups(subset='train', categories=categories,
                           remove=('headers','footers','quotes'))
test  = fetch_20newsgroups(subset='test',  categories=categories,
                           remove=('headers','footers','quotes'))

print(f"Train: {len(train.data)} docs, Test: {len(test.data)} docs")

Train: 2257 docs, Test: 1502 docs

python

pipeline = Pipeline([
    ('vect',  CountVectorizer(max_features=10000, stop_words='english')),
    ('tfidf', TfidfTransformer()),
    ('clf',   MultinomialNB(alpha=1.0)),
])

pipeline.fit(train.data, train.target)
y_pred = pipeline.predict(test.data)

print(f"Test Accuracy: {accuracy_score(test.target, y_pred):.4f}")
print(classification_report(test.target, y_pred, target_names=categories))

Test Accuracy: 0.8928

                    precision  recall  f1-score  support
      comp.graphics      0.89    0.79     0.84      389
  rec.sport.hockey       0.96    0.94     0.95      399
talk.politics.guns       0.85    0.89     0.87      364
         sci.space       0.86    0.94     0.90      394

89.3% accuracy on 4-class text classification — without any feature engineering beyond count vectorization and TF-IDF. Hockey is easiest (F1=0.95, distinct vocabulary). Graphics is hardest (F1=0.84, some overlap with sci.space in technical terms).

Inspecting What the Model Learned

python

vectorizer = pipeline.named_steps['vect']
clf = pipeline.named_steps['clf']
feature_names = vectorizer.get_feature_names_out()

print("Top 10 words per category (by log probability):")
for i, category in enumerate(categories):
    top_idx = clf.feature_log_prob_[i].argsort()[-10:][::-1]
    top_words = [feature_names[j] for j in top_idx]
    print(f"  {category}: {top_words}")

Top 10 words per category (by log probability):
  comp.graphics: ['image', 'gif', 'graphics', 'color', 'pixel', 'jpeg', 'format', 'file', 'images', 'display']
  rec.sport.hockey: ['hockey', 'nhl', 'team', 'game', 'players', 'season', 'league', 'ice', 'play', 'games']
  talk.politics.guns: ['gun', 'guns', 'firearms', 'rights', 'weapon', 'weapons', 'amendment', 'handgun', 'carry', 'people']
  sci.space: ['space', 'nasa', 'earth', 'orbit', 'shuttle', 'launch', 'mission', 'satellite', 'moon', 'solar']

The model has extracted semantically meaningful category markers with zero labeled feature guidance — only word counts and the naive independence assumption.

Alpha (Smoothing) Sweep

python

from sklearn.model_selection import cross_val_score

alphas = [0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 5.0]
print(f"{'alpha':>8} {'CV acc':>10} {'std':>8}")
for alpha in alphas:
    pipe = Pipeline([
        ('vect', CountVectorizer(max_features=10000, stop_words='english')),
        ('clf',  MultinomialNB(alpha=alpha)),
    ])
    scores = cross_val_score(pipe, train.data, train.target, cv=5, scoring='accuracy')
    print(f"{alpha:>8} {scores.mean():>10.4f} {scores.std():>8.4f}")

   alpha     CV acc      std
   0.001     0.8321   0.0142
    0.01     0.8601   0.0118
     0.1     0.8842   0.0095
     0.5     0.8878   0.0088
     1.0     0.8861   0.0091
     2.0     0.8814   0.0097
     5.0     0.8702   0.0115

Peak at $α \approx 0.5$ . Too small ( $α = 0.001$ ): rare words get near-zero probabilities and dominate decisions. Too large ( $α = 5.0$ ): all words get nearly uniform probabilities — the model loses discriminative signal. The sweet spot gives each unseen word a small but non-zero prior.

Bernoulli NB for Binary Features

python

from sklearn.naive_bayes import BernoulliNB

pipeline_bnb = Pipeline([
    ('vect', CountVectorizer(max_features=10000, stop_words='english', binary=True)),
    ('clf',  BernoulliNB(alpha=1.0)),
])

pipeline_bnb.fit(train.data, train.target)
y_pred_bnb = pipeline_bnb.predict(test.data)
print(f"BernoulliNB Accuracy: {accuracy_score(test.target, y_pred_bnb):.4f}")

BernoulliNB Accuracy: 0.8668

python

print("\nModel comparison on 20 Newsgroups:")
print(f"  MultinomialNB (alpha=1.0): {accuracy_score(test.target, y_pred):.4f}")
print(f"  BernoulliNB   (alpha=1.0): {accuracy_score(test.target, y_pred_bnb):.4f}")

Model comparison on 20 Newsgroups:
  MultinomialNB (alpha=1.0): 0.8928
  BernoulliNB   (alpha=1.0): 0.8668

MultinomialNB wins on 20 Newsgroups by 2.6 points. The reason: newsgroup posts are long (hundreds of words) and word frequency carries real signal — a post mentioning "hockey" 8 times is more likely about hockey than one mentioning it once. Bernoulli throws away that frequency information. For short texts (SMS spam, tweet sentiment), Bernoulli often matches or beats Multinomial because frequency is less informative.

Why Naive Bayes Works Despite Violated Assumptions

The independence assumption is wrong — "gun" and "firearms" co-occur in politics.guns posts far more than independence predicts. But this correlation doesn't prevent correct classification; it means the model double-counts correlated evidence, pushing posteriors toward 0 and 1 (overconfident predictions).

The log-space view makes this clear:

$lo g P (y ∣ x) = lo g P (y) + \sum_{j} x_{j} lo g P (x_{j} ∣ y) - lo g P (x)$

This is a linear classifier with fixed weights $lo g P (x_{j} ∣ y)$ set analytically from counts. Like any linear classifier, it can separate linearly separable classes in feature space. The naive assumption is wrong about the probabilities but often right about which class scores highest.

When does it fail? When correlated features pull the decision boundary in the wrong direction — e.g., a spam filter that treats "free" and "free!!!" as independent features, double-counting the spam signal from punctuation-inflated word variants.

Speed Comparison

Model	Training	Inference	Memory
Naive Bayes	$O (n \times p)$	$O (p)$	$O (K \times p)$
Logistic Regression	$O (n \times p \times iter)$	$O (p)$	$O (p)$
SVM (RBF)	$O (n^{2} p)$ to $O (n^{3})$	$O (n_{SV} \times p)$	$O (n_{SV} \times p)$

Naive Bayes training is a single pass through the data to accumulate counts — no iteration, no gradient, no matrix operations. This makes it ideal for streaming data, online updates (updating counts as new emails arrive), and very large datasets where SVM or logistic regression would be prohibitively slow.

Test Your Understanding

Gaussian NB learned 12 Gaussian distributions (4 features × 3 classes). If you add a 5th feature that is the product of petal_length and petal_width, Gaussian NB adds 3 more Gaussians. Logistic regression adds 1 more weight. Which model benefits more from the new feature, and why?
The alpha sweep peaks at $α \approx 0.5$ rather than $α = 1$ . If you used the full vocabulary (all words, not max_features=10000), would you expect the optimal alpha to increase or decrease? Relate your answer to the Laplace formula denominator.
MultinomialNB achieves 89.3% and BernoulliNB achieves 86.7% on 20 Newsgroups. If you ran both on a dataset of 10-word SMS messages, which would you expect to perform better and why?
feature_log_prob_[i] stores $lo g P (word_{j} ∣ y_{i})$ . For a long test document with 500 words, the log-probability is the sum of 500 terms. What numerical issue can arise when multiplying 500 probabilities instead of summing their logs? How does sklearn avoid it?
Naive Bayes can be updated online: when a new labeled email arrives, add its word counts to the class totals. Logistic regression requires retraining from scratch (or careful SGD updates). Name one real-world application where this online-update property is critical.

Naive Bayes: Practical Implementation

Gaussian NB on Iris

Class Probabilities for a New Sample

Multinomial NB on 20 Newsgroups

Inspecting What the Model Learned

Alpha (Smoothing) Sweep

Bernoulli NB for Binary Features

Why Naive Bayes Works Despite Violated Assumptions

Speed Comparison

Test Your Understanding

Comments (0)

Leave a comment