Back to blog
← View series: machine learning

~/blog

Decision Trees: Entropy and Gini Impurity

Jun 26, 20266 min readBy Mohammed Vasim
Machine LearningAIData Science

A decision tree is a classifier that asks a sequence of yes/no questions about input features, following a path from root to leaf, and returns the prediction at the leaf. The interesting part is not the prediction — it's how the tree decides which question to ask first. It needs a measure of how "mixed" a set of labels is. That measure is either entropy or Gini impurity.

Anchor dataset: 10-sample loan approval dataset. Features: income level (Low/High), employed (Yes/No).

python
import numpy as np
import pandas as pd

data = pd.DataFrame({
    'income':   ['Low','Low','Low','High','High','High','Low','High','Low','High'],
    'employed': ['No', 'Yes','Yes','No',  'Yes', 'Yes', 'No', 'No',  'Yes','Yes'],
    'approved': ['No', 'No', 'Yes','No',  'Yes', 'Yes', 'No', 'Yes', 'Yes','Yes']
})
# Approved: Yes=7, No=3 out of 10

What Is a Decision Tree?

A decision tree partitions training samples into progressively purer groups:

  1. At each internal node, test one feature
  2. Route samples to the left or right child based on the test outcome
  3. At each leaf, predict the majority class among samples that reached it

The construction is top-down and greedy: at each node, choose the single feature split that most reduces impurity in the resulting children — then recurse. There is no backtracking.

To compare splits, we need a measure of how mixed the class labels are at a node. Two standard measures: entropy and Gini impurity.

Root Node Impurity

Before any split: 7 approved (Yes), 3 not approved (No) out of 10 total.

Entropy

Range: for classes. : node is pure (all one class). : perfectly mixed for binary classification.

Computing for the root:

0.882 bits of uncertainty: if you randomly guessed the label of a randomly drawn sample, this measures how surprised you'd be. A perfectly mixed 50/50 node has bit — maximum surprise.

p₁ (probability of class 1) H (entropy in bits) <text x="55" y="213" font-size="8" fill="#64748b">0.0</text> <text x="153" y="213" font-size="8" fill="#64748b">0.25</text> <text x="249" y="213" font-size="8" fill="#64748b">0.5</text> <text x="345" y="213" font-size="8" fill="#64748b">0.75</text> <text x="445" y="213" font-size="8" fill="#64748b">1.0</text> <text x="44" y="203" text-anchor="end" font-size="8" fill="#64748b">0</text> <text x="44" y="111" text-anchor="end" font-size="8" fill="#64748b">0.5</text> <text x="44" y="20" text-anchor="end" font-size="8" fill="#64748b">1.0</text> <polyline points="55,200 95,156 135,112 175,75 215,45 255,18 295,45 335,75 375,112 415,156 455,200" fill="none" stroke="#3b82f6" stroke-width="2.5"/> <circle cx="55" cy="200" r="4" fill="#22c55e"/> <text x="58" y="195" font-size="8" fill="#22c55e">pure</text> <circle cx="455" cy="200" r="4" fill="#22c55e"/> <text x="430" y="195" font-size="8" fill="#22c55e">pure</text> <circle cx="255" cy="18" r="5" fill="#ef4444"/> <text x="260" y="30" font-size="8" fill="#ef4444">50/50 → H=1</text> <circle cx="335" cy="50" r="6" fill="#f59e0b" stroke="#d97706" stroke-width="1.5"/> <line x1="335" y1="56" x2="335" y2="200" stroke="#f59e0b" stroke-width="1" stroke-dasharray="3,2"/> <text x="340" y="80" font-size="9" fill="#d97706" font-weight="bold">root (0.7, 0.882)</text>

Gini Impurity

Range: . : pure. Interpretation: the probability that two randomly drawn samples from this node have different classes.

Computing for the root:

If you randomly pick two samples from this node, there's a 42% chance they have different labels.

Computing Impurity After a Split on "Income"

Split the 10 samples by income level:

  • Left (Low income): 5 samples → approved = [No, No, Yes, No, Yes] → Yes=2, No=3
  • Right (High income): 5 samples → approved = [No, Yes, Yes, Yes, Yes] → Yes=4, No=1

Left node: ,

Right node: ,

Weighted average (5 samples each, total 10):

Computing Impurity After a Split on "Employed"

  • Left (Employed=No): 4 samples → approved = [No, No, No, Yes] → Yes=1, No=3
  • Right (Employed=Yes): 6 samples → approved = [No, Yes, Yes, Yes, Yes, Yes] → Yes=5, No=1

Left: ,

Right: ,

Weighted average (4/10 left, 6/10 right):

Comparing Both Splits

SplitLeft HRight HWeighted HInfo GainGini Gain
Income0.9710.7220.847
Employed0.8110.6500.714

Decision: split on Employed first. Both entropy (IG=0.168) and Gini (gain=0.06) agree — Employed tells us much more about loan approval than Income does.

sklearn Confirmation

python
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import LabelEncoder
import numpy as np

le_inc = LabelEncoder(); le_emp = LabelEncoder(); le_app = LabelEncoder()
X = np.column_stack([le_inc.fit_transform(data['income']),
                     le_emp.fit_transform(data['employed'])])
y = le_app.fit_transform(data['approved'])

dt_entropy = DecisionTreeClassifier(criterion='entropy', max_depth=3, random_state=42)
dt_entropy.fit(X, y)
print("Feature importances (Entropy):", dt_entropy.feature_importances_.round(4))

dt_gini = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=42)
dt_gini.fit(X, y)
print("Feature importances (Gini):   ", dt_gini.feature_importances_.round(4))
Feature importances (Entropy): [0.17 0.83] # [income, employed] Feature importances (Gini): [0.04 0.96]

Employed accounts for 83% of importance by entropy, 96% by Gini — matching our manual IG calculation (Employed's IG was 5× larger than Income's).

Entropy vs Gini — When to Use Which

PropertyEntropyGini
Formula
ComputationRequires Only squares — faster
SensitivitySlightly more sensitive to class imbalanceLess sensitive
sklearn defaultcriterion='entropy' or 'log_loss'criterion='gini' (default)
Typical outcomeOften produces the same treePreferred for speed

In practice, entropy and Gini produce identical or very similar trees. Gini is sklearn's default because it avoids the logarithm computation. For most datasets, the choice between them doesn't matter.

Test Your Understanding

  1. A node has 100 samples: 50 class A, 30 class B, 20 class C. Compute the entropy and Gini impurity. Which is higher in relative terms (what fraction of the maximum possible impurity does each achieve)?

  2. The root has bits. After splitting on Employed, the weighted bits. After splitting on Income, weighted bits. A third feature, "credit score" (Low/High), produces weighted bits. Rank all three features by Information Gain.

  3. At the left node (Employed=No): Yes=1, No=3. Entropy = 0.811 bits. If you added one more sample to this node — either a Yes or a No — which would reduce entropy more, and by how much?

  4. Gini impurity is the probability that two randomly drawn samples have different classes. For the root node (Yes=7, No=3): verify the formula numerically. If you draw two samples with replacement, what is and ?

  5. A fully unconstrained decision tree on this 10-sample dataset would create at most how many leaves? Could it achieve training accuracy of 100%? Would entropy at every leaf be exactly 0?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment