← View series: machine learning
~/blog
Decision Trees: Entropy and Gini Impurity
A decision tree is a classifier that asks a sequence of yes/no questions about input features, following a path from root to leaf, and returns the prediction at the leaf. The interesting part is not the prediction — it's how the tree decides which question to ask first. It needs a measure of how "mixed" a set of labels is. That measure is either entropy or Gini impurity.
Anchor dataset: 10-sample loan approval dataset. Features: income level (Low/High), employed (Yes/No).
import numpy as np
import pandas as pd
data = pd.DataFrame({
'income': ['Low','Low','Low','High','High','High','Low','High','Low','High'],
'employed': ['No', 'Yes','Yes','No', 'Yes', 'Yes', 'No', 'No', 'Yes','Yes'],
'approved': ['No', 'No', 'Yes','No', 'Yes', 'Yes', 'No', 'Yes', 'Yes','Yes']
})
# Approved: Yes=7, No=3 out of 10What Is a Decision Tree?
A decision tree partitions training samples into progressively purer groups:
- At each internal node, test one feature
- Route samples to the left or right child based on the test outcome
- At each leaf, predict the majority class among samples that reached it
The construction is top-down and greedy: at each node, choose the single feature split that most reduces impurity in the resulting children — then recurse. There is no backtracking.
To compare splits, we need a measure of how mixed the class labels are at a node. Two standard measures: entropy and Gini impurity.
Root Node Impurity
Before any split: 7 approved (Yes), 3 not approved (No) out of 10 total.
Entropy
Range: for classes. : node is pure (all one class). : perfectly mixed for binary classification.
Computing for the root:
0.882 bits of uncertainty: if you randomly guessed the label of a randomly drawn sample, this measures how surprised you'd be. A perfectly mixed 50/50 node has bit — maximum surprise.
<text x="55" y="213" font-size="8" fill="#64748b">0.0</text>
<text x="153" y="213" font-size="8" fill="#64748b">0.25</text>
<text x="249" y="213" font-size="8" fill="#64748b">0.5</text>
<text x="345" y="213" font-size="8" fill="#64748b">0.75</text>
<text x="445" y="213" font-size="8" fill="#64748b">1.0</text>
<text x="44" y="203" text-anchor="end" font-size="8" fill="#64748b">0</text>
<text x="44" y="111" text-anchor="end" font-size="8" fill="#64748b">0.5</text>
<text x="44" y="20" text-anchor="end" font-size="8" fill="#64748b">1.0</text>
<polyline points="55,200 95,156 135,112 175,75 215,45 255,18 295,45 335,75 375,112 415,156 455,200" fill="none" stroke="#3b82f6" stroke-width="2.5"/>
<circle cx="55" cy="200" r="4" fill="#22c55e"/>
<text x="58" y="195" font-size="8" fill="#22c55e">pure</text>
<circle cx="455" cy="200" r="4" fill="#22c55e"/>
<text x="430" y="195" font-size="8" fill="#22c55e">pure</text>
<circle cx="255" cy="18" r="5" fill="#ef4444"/>
<text x="260" y="30" font-size="8" fill="#ef4444">50/50 → H=1</text>
<circle cx="335" cy="50" r="6" fill="#f59e0b" stroke="#d97706" stroke-width="1.5"/>
<line x1="335" y1="56" x2="335" y2="200" stroke="#f59e0b" stroke-width="1" stroke-dasharray="3,2"/>
<text x="340" y="80" font-size="9" fill="#d97706" font-weight="bold">root (0.7, 0.882)</text>
Gini Impurity
Range: . : pure. Interpretation: the probability that two randomly drawn samples from this node have different classes.
Computing for the root:
If you randomly pick two samples from this node, there's a 42% chance they have different labels.
Computing Impurity After a Split on "Income"
Split the 10 samples by income level:
- Left (Low income): 5 samples → approved = [No, No, Yes, No, Yes] → Yes=2, No=3
- Right (High income): 5 samples → approved = [No, Yes, Yes, Yes, Yes] → Yes=4, No=1
Left node: ,
Right node: ,
Weighted average (5 samples each, total 10):
Computing Impurity After a Split on "Employed"
- Left (Employed=No): 4 samples → approved = [No, No, No, Yes] → Yes=1, No=3
- Right (Employed=Yes): 6 samples → approved = [No, Yes, Yes, Yes, Yes, Yes] → Yes=5, No=1
Left: ,
Right: ,
Weighted average (4/10 left, 6/10 right):
Comparing Both Splits
| Split | Left H | Right H | Weighted H | Info Gain | Gini Gain |
|---|---|---|---|---|---|
| Income | 0.971 | 0.722 | 0.847 | ||
| Employed | 0.811 | 0.650 | 0.714 |
Decision: split on Employed first. Both entropy (IG=0.168) and Gini (gain=0.06) agree — Employed tells us much more about loan approval than Income does.
sklearn Confirmation
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import LabelEncoder
import numpy as np
le_inc = LabelEncoder(); le_emp = LabelEncoder(); le_app = LabelEncoder()
X = np.column_stack([le_inc.fit_transform(data['income']),
le_emp.fit_transform(data['employed'])])
y = le_app.fit_transform(data['approved'])
dt_entropy = DecisionTreeClassifier(criterion='entropy', max_depth=3, random_state=42)
dt_entropy.fit(X, y)
print("Feature importances (Entropy):", dt_entropy.feature_importances_.round(4))
dt_gini = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=42)
dt_gini.fit(X, y)
print("Feature importances (Gini): ", dt_gini.feature_importances_.round(4))Feature importances (Entropy): [0.17 0.83] # [income, employed]
Feature importances (Gini): [0.04 0.96]
Employed accounts for 83% of importance by entropy, 96% by Gini — matching our manual IG calculation (Employed's IG was 5× larger than Income's).
Entropy vs Gini — When to Use Which
| Property | Entropy | Gini |
|---|---|---|
| Formula | ||
| Computation | Requires | Only squares — faster |
| Sensitivity | Slightly more sensitive to class imbalance | Less sensitive |
| sklearn default | criterion='entropy' or 'log_loss' | criterion='gini' (default) |
| Typical outcome | Often produces the same tree | Preferred for speed |
In practice, entropy and Gini produce identical or very similar trees. Gini is sklearn's default because it avoids the logarithm computation. For most datasets, the choice between them doesn't matter.
Test Your Understanding
-
A node has 100 samples: 50 class A, 30 class B, 20 class C. Compute the entropy and Gini impurity. Which is higher in relative terms (what fraction of the maximum possible impurity does each achieve)?
-
The root has bits. After splitting on Employed, the weighted bits. After splitting on Income, weighted bits. A third feature, "credit score" (Low/High), produces weighted bits. Rank all three features by Information Gain.
-
At the left node (Employed=No): Yes=1, No=3. Entropy = 0.811 bits. If you added one more sample to this node — either a Yes or a No — which would reduce entropy more, and by how much?
-
Gini impurity is the probability that two randomly drawn samples have different classes. For the root node (Yes=7, No=3): verify the formula numerically. If you draw two samples with replacement, what is and ?
-
A fully unconstrained decision tree on this 10-sample dataset would create at most how many leaves? Could it achieve training accuracy of 100%? Would entropy at every leaf be exactly 0?