Back to blog
← View series: machine learning

Decision Trees: Entropy and Gini Impurity Information Gain and Full Tree Construction Splitting Numerical Features in Decision Trees Decision Tree Pruning: Pre-Pruning and Post-Pruning Decision Tree Regression Decision Tree: Diabetes Prediction Project

~/blog

Information Gain and Full Tree Construction

Jun 26, 2026•6 min read•By Mohammed Vasim

Machine LearningAIData Science

Post 01 showed that splitting on Employed reduces entropy more than splitting on Income. This post builds the full decision tree level by level — computing Information Gain at each node, choosing splits, and tracing predictions on new samples.

Same anchor dataset: 10-sample loan approval.

python

import pandas as pd
import numpy as np

data = pd.DataFrame({
    'income':   ['Low','Low','Low','High','High','High','Low','High','Low','High'],
    'employed': ['No', 'Yes','Yes','No',  'Yes', 'Yes', 'No', 'No',  'Yes','Yes'],
    'approved': ['No', 'No', 'Yes','No',  'Yes', 'Yes', 'No', 'Yes', 'Yes','Yes']
})

Information Gain — Formal Definition

$IG (S, A) = H (S) - \sum_{v} \frac{∣ S _{v} ∣}{∣ S ∣} \times H (S_{v})$

Where $S$ is the current node's sample set, $A$ is the feature being tested, and $S_{v}$ is the subset where feature $A$ equals value $v$ . Information Gain measures how much entropy decreases after observing feature $A$ .

High IG: the feature resolves most of the uncertainty in the labels. Zero IG: the feature tells us nothing — class distributions in each child are identical to the parent.

Level 0: Root Node

All 10 samples. $H (root) = 0.882$ bits (computed in post 01).

Feature	Weighted $H$ after split	IG
Income	0.847	0.035
Employed	0.714	0.168

Best split: Employed. Creates two children:

Left branch (Employed=No): Rows 0, 3, 6, 7 → Income=[Low,High,Low,High], Approved=[No,No,No,Yes] → Yes=1, No=3

Right branch (Employed=Yes): Rows 1, 2, 4, 5, 8, 9 → Approved=[No,Yes,Yes,Yes,Yes,Yes] → Yes=5, No=1

Level 1: Left Node (Employed=No, 4 samples)

Samples: Income=[Low,High,Low,High], Approved=[No,No,No,Yes]

$P (Yes) = 1/4 = 0.25$ , $H_{left} = 0.811$ bits. Not pure — can we split further?

Only one remaining feature (Income). Test it:

Income=Low (2 samples): [No, No] → Yes=0, No=2. $H = 0$ (pure!)
Income=High (2 samples): [No, Yes] → Yes=1, No=1. $H = 1.0$ bits

Weighted $H = (2/4) \times 0 + (2/4) \times 1.0 = 0.5$ bits

$IG (Income ∣ Employed=No) = 0.811 - 0.5 = 0.311 bits$

Split on Income.

Left-Left (Employed=No, Income=Low): 2 samples, both No → $H = 0$ . Predict: No ✓
Left-Right (Employed=No, Income=High): 2 samples [No, Yes] → $H = 1.0$ . No features remain. Tie (1:1). Predict: No (majority class of the left parent, which is 3/4 No)

Level 1: Right Node (Employed=Yes, 6 samples)

Samples: Income=[Low,High,High,High,Low,High], Approved=[No,Yes,Yes,Yes,Yes,Yes]

$P (Yes) = 5/6$ , $H_{right} = 0.650$ bits. Still impure.

Test Income:

Income=Low (2 samples, rows 1 and 8): [No, Yes] → Yes=1, No=1. $H = 1.0$ bits
Income=High (4 samples, rows 2, 4, 5, 9): [Yes, Yes, Yes, Yes] → Yes=4, No=0. $H = 0$ (pure!)

Weighted $H = (2/6) \times 1.0 + (4/6) \times 0 = 0.333$ bits

$IG (Income ∣ Employed=Yes) = 0.650 - 0.333 = 0.317 bits$

Split on Income.

Right-Left (Employed=Yes, Income=Low): 2 samples [No, Yes] → $H = 1.0$ . Tie. Predict: Yes (majority of right parent, which is 5/6 Yes)
Right-Right (Employed=Yes, Income=High): 4 samples, all Yes → $H = 0$ . Predict: Yes ✓

Final Tree Structure

              Employed?
             /           \
           No              Yes
         Income?          Income?
        /       \         /       \
      Low       High   Low       High
    [No,No]  [No,Yes]  [No,Yes]  [Yes,Yes,Yes,Yes]
    →No      →No(tie)  →Yes(tie)  →Yes

<line x1="240" y1="60" x2="130" y2="115" stroke="#94a3b8" stroke-width="1.5"/>
<line x1="320" y1="60" x2="430" y2="115" stroke="#94a3b8" stroke-width="1.5"/>
<text x="155" y="95" font-size="9" fill="#64748b">No (n=4)</text>
<text x="345" y="95" font-size="9" fill="#64748b">Yes (n=6)</text>

<rect x="55" y="115" width="150" height="50" rx="6" fill="#fef3c7" stroke="#f59e0b" stroke-width="1.5"/>
<text x="130" y="135" text-anchor="middle" font-size="11" font-weight="bold" fill="#92400e">Income?</text>
<text x="130" y="152" text-anchor="middle" font-size="9" fill="#92400e">H=0.811, n=4</text>

<rect x="355" y="115" width="150" height="50" rx="6" fill="#fef3c7" stroke="#f59e0b" stroke-width="1.5"/>
<text x="430" y="135" text-anchor="middle" font-size="11" font-weight="bold" fill="#92400e">Income?</text>
<text x="430" y="152" text-anchor="middle" font-size="9" fill="#92400e">H=0.650, n=6</text>

<line x1="95" y1="165" x2="60" y2="210" stroke="#94a3b8" stroke-width="1.5"/>
<line x1="165" y1="165" x2="200" y2="210" stroke="#94a3b8" stroke-width="1.5"/>
<line x1="395" y1="165" x2="360" y2="210" stroke="#94a3b8" stroke-width="1.5"/>
<line x1="465" y1="165" x2="500" y2="210" stroke="#94a3b8" stroke-width="1.5"/>

<text x="68" y="196" font-size="8" fill="#64748b">Low</text>
<text x="185" y="196" font-size="8" fill="#64748b">High</text>
<text x="340" y="196" font-size="8" fill="#64748b">Low</text>
<text x="486" y="196" font-size="8" fill="#64748b">High</text>

<rect x="20" y="210" width="85" height="45" rx="4" fill="#dcfce7" stroke="#22c55e" stroke-width="1.5"/>
<text x="62" y="228" text-anchor="middle" font-size="10" font-weight="bold" fill="#15803d">No ✓</text>
<text x="62" y="242" text-anchor="middle" font-size="8" fill="#15803d">n=2, H=0</text>
<text x="62" y="252" text-anchor="middle" font-size="8" fill="#15803d">pure</text>

<rect x="160" y="210" width="85" height="45" rx="4" fill="#fee2e2" stroke="#ef4444" stroke-width="1.5"/>
<text x="202" y="228" text-anchor="middle" font-size="10" font-weight="bold" fill="#991b1b">No (tie)</text>
<text x="202" y="242" text-anchor="middle" font-size="8" fill="#991b1b">n=2, H=1.0</text>
<text x="202" y="252" text-anchor="middle" font-size="8" fill="#991b1b">50/50</text>

<rect x="320" y="210" width="85" height="45" rx="4" fill="#fee2e2" stroke="#ef4444" stroke-width="1.5"/>
<text x="362" y="228" text-anchor="middle" font-size="10" font-weight="bold" fill="#991b1b">Yes (tie)</text>
<text x="362" y="242" text-anchor="middle" font-size="8" fill="#991b1b">n=2, H=1.0</text>
<text x="362" y="252" text-anchor="middle" font-size="8" fill="#991b1b">50/50</text>

<rect x="460" y="210" width="85" height="45" rx="4" fill="#dcfce7" stroke="#22c55e" stroke-width="1.5"/>
<text x="502" y="228" text-anchor="middle" font-size="10" font-weight="bold" fill="#15803d">Yes ✓</text>
<text x="502" y="242" text-anchor="middle" font-size="8" fill="#15803d">n=4, H=0</text>
<text x="502" y="252" text-anchor="middle" font-size="8" fill="#15803d">pure</text>

Two leaves are pure (green); two are impure ties (red) — resolved by the parent's majority class.

Prediction Trace for New Samples

Income	Employed	Tree path	Prediction
Low	No	Employed=No → Income=Low → Leaf	No
High	Yes	Employed=Yes → Income=High → Leaf	Yes
High	No	Employed=No → Income=High → Tie leaf	No
Low	Yes	Employed=Yes → Income=Low → Tie leaf	Yes

Information Gain Ratio (C4.5 Variant)

Plain Information Gain is biased toward features with many distinct values. A feature that assigns each sample to its own unique bucket always achieves IG = H(root) — perfect information, but it just memorizes the data (like using a unique sample ID).

The fix: normalize IG by the entropy of the feature itself:

$SplitInfo (A) = - \sum_{v} \frac{∣ S _{v} ∣}{∣ S ∣} lo g_{2} \frac{∣ S _{v} ∣}{∣ S ∣}$

$GainRatio (A) = \frac{IG ( A )}{SplitInfo ( A )}$

For Income (5 Low, 5 High — equal split):

$SplitInfo (Income) = - 2 \times (0.5 \times lo g_{2} 0.5) = 1.0 bit$ $GainRatio = 0.035/1.0 = 0.035$

For Employed (4 No, 6 Yes — unequal split):

$SplitInfo (Employed) = - (0.4 \times lo g_{2} 0.4 + 0.6 \times lo g_{2} 0.6) = 0.529 + 0.442 = 0.971$ $GainRatio = 0.168/0.971 = 0.173$

Both Gain Ratio and plain IG select Employed as the root split. For this dataset, the adjustment doesn't change the decision — but on datasets with high-cardinality features (zip code, user ID), it prevents the model from trivially splitting on a unique identifier.

sklearn Tree Visualization