Back to blog
← View series: machine learning

~/blog

Information Gain and Full Tree Construction

Jun 26, 20266 min readBy Mohammed Vasim
Machine LearningAIData Science

Post 01 showed that splitting on Employed reduces entropy more than splitting on Income. This post builds the full decision tree level by level — computing Information Gain at each node, choosing splits, and tracing predictions on new samples.

Same anchor dataset: 10-sample loan approval.

python
import pandas as pd
import numpy as np

data = pd.DataFrame({
    'income':   ['Low','Low','Low','High','High','High','Low','High','Low','High'],
    'employed': ['No', 'Yes','Yes','No',  'Yes', 'Yes', 'No', 'No',  'Yes','Yes'],
    'approved': ['No', 'No', 'Yes','No',  'Yes', 'Yes', 'No', 'Yes', 'Yes','Yes']
})

Information Gain — Formal Definition

Where is the current node's sample set, is the feature being tested, and is the subset where feature equals value . Information Gain measures how much entropy decreases after observing feature .

High IG: the feature resolves most of the uncertainty in the labels. Zero IG: the feature tells us nothing — class distributions in each child are identical to the parent.

Level 0: Root Node

All 10 samples. bits (computed in post 01).

FeatureWeighted after splitIG
Income0.8470.035
Employed0.7140.168

Best split: Employed. Creates two children:

Left branch (Employed=No): Rows 0, 3, 6, 7 → Income=[Low,High,Low,High], Approved=[No,No,No,Yes] → Yes=1, No=3

Right branch (Employed=Yes): Rows 1, 2, 4, 5, 8, 9 → Approved=[No,Yes,Yes,Yes,Yes,Yes] → Yes=5, No=1

Level 1: Left Node (Employed=No, 4 samples)

Samples: Income=[Low,High,Low,High], Approved=[No,No,No,Yes]

, bits. Not pure — can we split further?

Only one remaining feature (Income). Test it:

  • Income=Low (2 samples): [No, No] → Yes=0, No=2. (pure!)
  • Income=High (2 samples): [No, Yes] → Yes=1, No=1. bits

Weighted bits

Split on Income.

  • Left-Left (Employed=No, Income=Low): 2 samples, both No → . Predict: No
  • Left-Right (Employed=No, Income=High): 2 samples [No, Yes] → . No features remain. Tie (1:1). Predict: No (majority class of the left parent, which is 3/4 No)

Level 1: Right Node (Employed=Yes, 6 samples)

Samples: Income=[Low,High,High,High,Low,High], Approved=[No,Yes,Yes,Yes,Yes,Yes]

, bits. Still impure.

Test Income:

  • Income=Low (2 samples, rows 1 and 8): [No, Yes] → Yes=1, No=1. bits
  • Income=High (4 samples, rows 2, 4, 5, 9): [Yes, Yes, Yes, Yes] → Yes=4, No=0. (pure!)

Weighted bits

Split on Income.

  • Right-Left (Employed=Yes, Income=Low): 2 samples [No, Yes] → . Tie. Predict: Yes (majority of right parent, which is 5/6 Yes)
  • Right-Right (Employed=Yes, Income=High): 4 samples, all Yes → . Predict: Yes

Final Tree Structure

Employed? / \ No Yes Income? Income? / \ / \ Low High Low High [No,No] [No,Yes] [No,Yes] [Yes,Yes,Yes,Yes] →No →No(tie) →Yes(tie) →Yes Employed? H=0.882, n=10 <line x1="240" y1="60" x2="130" y2="115" stroke="#94a3b8" stroke-width="1.5"/> <line x1="320" y1="60" x2="430" y2="115" stroke="#94a3b8" stroke-width="1.5"/> <text x="155" y="95" font-size="9" fill="#64748b">No (n=4)</text> <text x="345" y="95" font-size="9" fill="#64748b">Yes (n=6)</text> <rect x="55" y="115" width="150" height="50" rx="6" fill="#fef3c7" stroke="#f59e0b" stroke-width="1.5"/> <text x="130" y="135" text-anchor="middle" font-size="11" font-weight="bold" fill="#92400e">Income?</text> <text x="130" y="152" text-anchor="middle" font-size="9" fill="#92400e">H=0.811, n=4</text> <rect x="355" y="115" width="150" height="50" rx="6" fill="#fef3c7" stroke="#f59e0b" stroke-width="1.5"/> <text x="430" y="135" text-anchor="middle" font-size="11" font-weight="bold" fill="#92400e">Income?</text> <text x="430" y="152" text-anchor="middle" font-size="9" fill="#92400e">H=0.650, n=6</text> <line x1="95" y1="165" x2="60" y2="210" stroke="#94a3b8" stroke-width="1.5"/> <line x1="165" y1="165" x2="200" y2="210" stroke="#94a3b8" stroke-width="1.5"/> <line x1="395" y1="165" x2="360" y2="210" stroke="#94a3b8" stroke-width="1.5"/> <line x1="465" y1="165" x2="500" y2="210" stroke="#94a3b8" stroke-width="1.5"/> <text x="68" y="196" font-size="8" fill="#64748b">Low</text> <text x="185" y="196" font-size="8" fill="#64748b">High</text> <text x="340" y="196" font-size="8" fill="#64748b">Low</text> <text x="486" y="196" font-size="8" fill="#64748b">High</text> <rect x="20" y="210" width="85" height="45" rx="4" fill="#dcfce7" stroke="#22c55e" stroke-width="1.5"/> <text x="62" y="228" text-anchor="middle" font-size="10" font-weight="bold" fill="#15803d">No ✓</text> <text x="62" y="242" text-anchor="middle" font-size="8" fill="#15803d">n=2, H=0</text> <text x="62" y="252" text-anchor="middle" font-size="8" fill="#15803d">pure</text> <rect x="160" y="210" width="85" height="45" rx="4" fill="#fee2e2" stroke="#ef4444" stroke-width="1.5"/> <text x="202" y="228" text-anchor="middle" font-size="10" font-weight="bold" fill="#991b1b">No (tie)</text> <text x="202" y="242" text-anchor="middle" font-size="8" fill="#991b1b">n=2, H=1.0</text> <text x="202" y="252" text-anchor="middle" font-size="8" fill="#991b1b">50/50</text> <rect x="320" y="210" width="85" height="45" rx="4" fill="#fee2e2" stroke="#ef4444" stroke-width="1.5"/> <text x="362" y="228" text-anchor="middle" font-size="10" font-weight="bold" fill="#991b1b">Yes (tie)</text> <text x="362" y="242" text-anchor="middle" font-size="8" fill="#991b1b">n=2, H=1.0</text> <text x="362" y="252" text-anchor="middle" font-size="8" fill="#991b1b">50/50</text> <rect x="460" y="210" width="85" height="45" rx="4" fill="#dcfce7" stroke="#22c55e" stroke-width="1.5"/> <text x="502" y="228" text-anchor="middle" font-size="10" font-weight="bold" fill="#15803d">Yes ✓</text> <text x="502" y="242" text-anchor="middle" font-size="8" fill="#15803d">n=4, H=0</text> <text x="502" y="252" text-anchor="middle" font-size="8" fill="#15803d">pure</text>

Two leaves are pure (green); two are impure ties (red) — resolved by the parent's majority class.

Prediction Trace for New Samples

IncomeEmployedTree pathPrediction
LowNoEmployed=No → Income=Low → LeafNo
HighYesEmployed=Yes → Income=High → LeafYes
HighNoEmployed=No → Income=High → Tie leafNo
LowYesEmployed=Yes → Income=Low → Tie leafYes

Information Gain Ratio (C4.5 Variant)

Plain Information Gain is biased toward features with many distinct values. A feature that assigns each sample to its own unique bucket always achieves IG = H(root) — perfect information, but it just memorizes the data (like using a unique sample ID).

The fix: normalize IG by the entropy of the feature itself:

For Income (5 Low, 5 High — equal split):

For Employed (4 No, 6 Yes — unequal split):

Both Gain Ratio and plain IG select Employed as the root split. For this dataset, the adjustment doesn't change the decision — but on datasets with high-cardinality features (zip code, user ID), it prevents the model from trivially splitting on a unique identifier.

sklearn Tree Visualization

python
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.preprocessing import LabelEncoder
import numpy as np

le_inc = LabelEncoder(); le_emp = LabelEncoder(); le_app = LabelEncoder()
X = np.column_stack([le_inc.fit_transform(data['income']),
                     le_emp.fit_transform(data['employed'])])
y = le_app.fit_transform(data['approved'])

dt = DecisionTreeClassifier(criterion='entropy', max_depth=2, random_state=42)
dt.fit(X, y)

print(export_text(dt, feature_names=['income', 'employed']))
|--- employed <= 0.50 | |--- income <= 0.50 | | |--- class: 0 | |--- income > 0.50 | | |--- class: 0 |--- employed > 0.50 | |--- income <= 0.50 | | |--- class: 1 | |--- income > 0.50 | | |--- class: 1

The root split is on employed — matches our manual calculation. Low Income (≤0.5) maps to No (class 0) under both Employed=No branches.

Test Your Understanding

  1. At the tie leaf (Employed=No, Income=High): 1 Yes, 1 No. The tree predicts No (following the parent majority). If this leaf received 10 new predictions where 7 were actually Yes, what would be the local accuracy of this leaf? Is this evidence of underfitting or a data limitation?

  2. Information Gain at the root was IG(Employed)=0.168. At the right node (Employed=Yes), IG(Income)=0.317 — larger than the root. Does this mean Income is a better feature than Employed overall? Explain why deeper nodes can have higher IG than the root.

  3. GainRatio penalizes features with unequal splits. A feature with only 1 sample in one branch and 9 in the other has bits — lower than a 50/50 split. If IG = 0.2 bits for this feature, what is its GainRatio? Is this higher or lower than a feature with IG=0.15 and a 50/50 split?

  4. The tree has depth 2 with 4 leaves for 10 samples. The unpruned tree (no max_depth) could create up to 10 leaves. Compute the entropy of each possible pure leaf (1 sample each). What would be the training accuracy?

  5. export_text encodes "Low" and "High" as 0 and 1 (alphabetical LabelEncoder order). If you forgot to check the encoding and assumed 0=High and 1=Low, your prediction for {income=High, employed=Yes} would be wrong. How would you verify the encoding direction from the sklearn output?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment