Back to blog
← View series: machine learning

Decision Trees: Entropy and Gini Impurity Information Gain and Full Tree Construction Splitting Numerical Features in Decision Trees Decision Tree Pruning: Pre-Pruning and Post-Pruning Decision Tree Regression Decision Tree: Diabetes Prediction Project

~/blog

Decision Trees: Entropy and Gini Impurity

Jun 26, 2026•6 min read•By Mohammed Vasim

Machine LearningAIData Science

A decision tree is a classifier that asks a sequence of yes/no questions about input features, following a path from root to leaf, and returns the prediction at the leaf. The interesting part is not the prediction — it's how the tree decides which question to ask first. It needs a measure of how "mixed" a set of labels is. That measure is either entropy or Gini impurity.

Anchor dataset: 10-sample loan approval dataset. Features: income level (Low/High), employed (Yes/No).

python

import numpy as np
import pandas as pd

data = pd.DataFrame({
    'income':   ['Low','Low','Low','High','High','High','Low','High','Low','High'],
    'employed': ['No', 'Yes','Yes','No',  'Yes', 'Yes', 'No', 'No',  'Yes','Yes'],
    'approved': ['No', 'No', 'Yes','No',  'Yes', 'Yes', 'No', 'Yes', 'Yes','Yes']
})
# Approved: Yes=7, No=3 out of 10

What Is a Decision Tree?

A decision tree partitions training samples into progressively purer groups:

At each internal node, test one feature
Route samples to the left or right child based on the test outcome
At each leaf, predict the majority class among samples that reached it

The construction is top-down and greedy: at each node, choose the single feature split that most reduces impurity in the resulting children — then recurse. There is no backtracking.

To compare splits, we need a measure of how mixed the class labels are at a node. Two standard measures: entropy and Gini impurity.

Root Node Impurity

Before any split: 7 approved (Yes), 3 not approved (No) out of 10 total.

$P (Yes) = 7/10 = 0.7, P (No) = 3/10 = 0.3$

Entropy

$H = - \sum_{k} p_{k} lo g_{2} (p_{k})$

Range: $[0, lo g_{2} (K)]$ for $K$ classes. $H = 0$ : node is pure (all one class). $H = 1$ : perfectly mixed for binary classification.

Computing for the root:

$H = - (0.7 \times lo g_{2} 0.7) - (0.3 \times lo g_{2} 0.3)$

$lo g_{2} (0.7) = ln (0.7) / ln (2) = - 0.357/0.693 = - 0.515$ $lo g_{2} (0.3) = ln (0.3) / ln (2) = - 1.204/0.693 = - 1.737$

$H = - (0.7 \times - 0.515) - (0.3 \times - 1.737) = 0.3605 + 0.5211 = 0.882 bits$

0.882 bits of uncertainty: if you randomly guessed the label of a randomly drawn sample, this measures how surprised you'd be. A perfectly mixed 50/50 node has $H = 1$ bit — maximum surprise.

<text x="55" y="213" font-size="8" fill="#64748b">0.0</text>
<text x="153" y="213" font-size="8" fill="#64748b">0.25</text>
<text x="249" y="213" font-size="8" fill="#64748b">0.5</text>
<text x="345" y="213" font-size="8" fill="#64748b">0.75</text>
<text x="445" y="213" font-size="8" fill="#64748b">1.0</text>

<text x="44" y="203" text-anchor="end" font-size="8" fill="#64748b">0</text>
<text x="44" y="111" text-anchor="end" font-size="8" fill="#64748b">0.5</text>
<text x="44" y="20" text-anchor="end" font-size="8" fill="#64748b">1.0</text>

<polyline points="55,200 95,156 135,112 175,75 215,45 255,18 295,45 335,75 375,112 415,156 455,200" fill="none" stroke="#3b82f6" stroke-width="2.5"/>

<circle cx="55" cy="200" r="4" fill="#22c55e"/>
<text x="58" y="195" font-size="8" fill="#22c55e">pure</text>
<circle cx="455" cy="200" r="4" fill="#22c55e"/>
<text x="430" y="195" font-size="8" fill="#22c55e">pure</text>
<circle cx="255" cy="18" r="5" fill="#ef4444"/>
<text x="260" y="30" font-size="8" fill="#ef4444">50/50 → H=1</text>

<circle cx="335" cy="50" r="6" fill="#f59e0b" stroke="#d97706" stroke-width="1.5"/>
<line x1="335" y1="56" x2="335" y2="200" stroke="#f59e0b" stroke-width="1" stroke-dasharray="3,2"/>
<text x="340" y="80" font-size="9" fill="#d97706" font-weight="bold">root (0.7, 0.882)</text>

Gini Impurity

$Gini = 1 - \sum_{k} p_{k}^{2}$

Range: $[0, 1 - 1/ K]$ . $Gini = 0$ : pure. Interpretation: the probability that two randomly drawn samples from this node have different classes.

Computing for the root:

$Gini = 1 - (0. 7^{2} + 0. 3^{2}) = 1 - (0.49 + 0.09) = 1 - 0.58 = 0.42$

If you randomly pick two samples from this node, there's a 42% chance they have different labels.

Computing Impurity After a Split on "Income"

Split the 10 samples by income level:

Left (Low income): 5 samples → approved = [No, No, Yes, No, Yes] → Yes=2, No=3
Right (High income): 5 samples → approved = [No, Yes, Yes, Yes, Yes] → Yes=4, No=1

Left node: $P (Yes) = 2/5 = 0.4$ , $P (No) = 3/5 = 0.6$

$H_{left} = - (0.4 \times lo g_{2} 0.4) - (0.6 \times lo g_{2} 0.6) = - (0.4 \times - 1.322) - (0.6 \times - 0.737) = 0.529 + 0.442 = 0.971 bits$ $Gini_{left} = 1 - (0. 4^{2} + 0. 6^{2}) = 1 - 0.52 = 0.48$

Right node: $P (Yes) = 4/5 = 0.8$ , $P (No) = 1/5 = 0.2$

$H_{right} = - (0.8 \times - 0.322) - (0.2 \times - 2.322) = 0.258 + 0.464 = 0.722 bits$ $Gini_{right} = 1 - (0. 8^{2} + 0. 2^{2}) = 1 - 0.68 = 0.32$

Weighted average (5 samples each, total 10):

$H_{Income} = \frac{5}{10} \times 0.971 + \frac{5}{10} \times 0.722 = 0.847 bits$ $Gini_{Income} = 0.5 \times 0.48 + 0.5 \times 0.32 = 0.40$

Computing Impurity After a Split on "Employed"

Left (Employed=No): 4 samples → approved = [No, No, No, Yes] → Yes=1, No=3
Right (Employed=Yes): 6 samples → approved = [No, Yes, Yes, Yes, Yes, Yes] → Yes=5, No=1

Left: $P (Yes) = 0.25$ , $P (No) = 0.75$ $H_{left} = - (0.25 \times - 2) - (0.75 \times - 0.415) = 0.5 + 0.311 = 0.811 bits$

Right: $P (Yes) = 5/6$ , $P (No) = 1/6$ $H_{right} = - (5/6 \times - 0.263) - (1/6 \times - 2.585) = 0.219 + 0.431 = 0.650 bits$

Weighted average (4/10 left, 6/10 right):

$H_{Employed} = \frac{4}{10} \times 0.811 + \frac{6}{10} \times 0.650 = 0.324 + 0.390 = 0.714 bits$

Comparing Both Splits

Split	Left H	Right H	Weighted H	Info Gain	Gini Gain
Income	0.971	0.722	0.847	$0.882 - 0.847 = 0.035$	$0.42 - 0.40 = 0.02$
Employed	0.811	0.650	0.714	$0.882 - 0.714 = 0.168$	$0.42 - 0.36 = 0.06$

Decision: split on Employed first. Both entropy (IG=0.168) and Gini (gain=0.06) agree — Employed tells us much more about loan approval than Income does.

sklearn Confirmation

python

from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import LabelEncoder
import numpy as np

le_inc = LabelEncoder(); le_emp = LabelEncoder(); le_app = LabelEncoder()
X = np.column_stack([le_inc.fit_transform(data['income']),
                     le_emp.fit_transform(data['employed'])])
y = le_app.fit_transform(data['approved'])

dt_entropy = DecisionTreeClassifier(criterion='entropy', max_depth=3, random_state=42)
dt_entropy.fit(X, y)
print("Feature importances (Entropy):", dt_entropy.feature_importances_.round(4))

dt_gini = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=42)
dt_gini.fit(X, y)
print("Feature importances (Gini):   ", dt_gini.feature_importances_.round(4))

Feature importances (Entropy): [0.17 0.83]  # [income, employed]
Feature importances (Gini):    [0.04 0.96]

Employed accounts for 83% of importance by entropy, 96% by Gini — matching our manual IG calculation (Employed's IG was 5× larger than Income's).

Entropy vs Gini — When to Use Which

Property	Entropy	Gini
Formula	$- \sum p_{k} lo g_{2} p_{k}$	$1 - \sum p_{k}^{2}$
Computation	Requires $lo g ()$	Only squares — faster
Sensitivity	Slightly more sensitive to class imbalance	Less sensitive
sklearn default	`criterion='entropy'` or `'log_loss'`	`criterion='gini'` (default)
Typical outcome	Often produces the same tree	Preferred for speed

In practice, entropy and Gini produce identical or very similar trees. Gini is sklearn's default because it avoids the logarithm computation. For most datasets, the choice between them doesn't matter.

Test Your Understanding

A node has 100 samples: 50 class A, 30 class B, 20 class C. Compute the entropy and Gini impurity. Which is higher in relative terms (what fraction of the maximum possible impurity does each achieve)?
The root has $H = 0.882$ bits. After splitting on Employed, the weighted $H = 0.714$ bits. After splitting on Income, weighted $H = 0.847$ bits. A third feature, "credit score" (Low/High), produces weighted $H = 0.750$ bits. Rank all three features by Information Gain.
At the left node (Employed=No): Yes=1, No=3. Entropy = 0.811 bits. If you added one more sample to this node — either a Yes or a No — which would reduce entropy more, and by how much?
Gini impurity is the probability that two randomly drawn samples have different classes. For the root node (Yes=7, No=3): verify the formula numerically. If you draw two samples with replacement, what is $P (both same class)$ and $P (different classes)$ ?
A fully unconstrained decision tree on this 10-sample dataset would create at most how many leaves? Could it achieve training accuracy of 100%? Would entropy at every leaf be exactly 0?