← View series: machine learning
~/blog
Information Gain and Full Tree Construction
Post 01 showed that splitting on Employed reduces entropy more than splitting on Income. This post builds the full decision tree level by level — computing Information Gain at each node, choosing splits, and tracing predictions on new samples.
Same anchor dataset: 10-sample loan approval.
import pandas as pd
import numpy as np
data = pd.DataFrame({
'income': ['Low','Low','Low','High','High','High','Low','High','Low','High'],
'employed': ['No', 'Yes','Yes','No', 'Yes', 'Yes', 'No', 'No', 'Yes','Yes'],
'approved': ['No', 'No', 'Yes','No', 'Yes', 'Yes', 'No', 'Yes', 'Yes','Yes']
})Information Gain — Formal Definition
Where is the current node's sample set, is the feature being tested, and is the subset where feature equals value . Information Gain measures how much entropy decreases after observing feature .
High IG: the feature resolves most of the uncertainty in the labels. Zero IG: the feature tells us nothing — class distributions in each child are identical to the parent.
Level 0: Root Node
All 10 samples. bits (computed in post 01).
| Feature | Weighted after split | IG |
|---|---|---|
| Income | 0.847 | 0.035 |
| Employed | 0.714 | 0.168 |
Best split: Employed. Creates two children:
Left branch (Employed=No): Rows 0, 3, 6, 7 → Income=[Low,High,Low,High], Approved=[No,No,No,Yes] → Yes=1, No=3
Right branch (Employed=Yes): Rows 1, 2, 4, 5, 8, 9 → Approved=[No,Yes,Yes,Yes,Yes,Yes] → Yes=5, No=1
Level 1: Left Node (Employed=No, 4 samples)
Samples: Income=[Low,High,Low,High], Approved=[No,No,No,Yes]
, bits. Not pure — can we split further?
Only one remaining feature (Income). Test it:
- Income=Low (2 samples): [No, No] → Yes=0, No=2. (pure!)
- Income=High (2 samples): [No, Yes] → Yes=1, No=1. bits
Weighted bits
Split on Income.
- Left-Left (Employed=No, Income=Low): 2 samples, both No → . Predict: No ✓
- Left-Right (Employed=No, Income=High): 2 samples [No, Yes] → . No features remain. Tie (1:1). Predict: No (majority class of the left parent, which is 3/4 No)
Level 1: Right Node (Employed=Yes, 6 samples)
Samples: Income=[Low,High,High,High,Low,High], Approved=[No,Yes,Yes,Yes,Yes,Yes]
, bits. Still impure.
Test Income:
- Income=Low (2 samples, rows 1 and 8): [No, Yes] → Yes=1, No=1. bits
- Income=High (4 samples, rows 2, 4, 5, 9): [Yes, Yes, Yes, Yes] → Yes=4, No=0. (pure!)
Weighted bits
Split on Income.
- Right-Left (Employed=Yes, Income=Low): 2 samples [No, Yes] → . Tie. Predict: Yes (majority of right parent, which is 5/6 Yes)
- Right-Right (Employed=Yes, Income=High): 4 samples, all Yes → . Predict: Yes ✓
Final Tree Structure
Employed?
/ \
No Yes
Income? Income?
/ \ / \
Low High Low High
[No,No] [No,Yes] [No,Yes] [Yes,Yes,Yes,Yes]
→No →No(tie) →Yes(tie) →Yes
<line x1="240" y1="60" x2="130" y2="115" stroke="#94a3b8" stroke-width="1.5"/>
<line x1="320" y1="60" x2="430" y2="115" stroke="#94a3b8" stroke-width="1.5"/>
<text x="155" y="95" font-size="9" fill="#64748b">No (n=4)</text>
<text x="345" y="95" font-size="9" fill="#64748b">Yes (n=6)</text>
<rect x="55" y="115" width="150" height="50" rx="6" fill="#fef3c7" stroke="#f59e0b" stroke-width="1.5"/>
<text x="130" y="135" text-anchor="middle" font-size="11" font-weight="bold" fill="#92400e">Income?</text>
<text x="130" y="152" text-anchor="middle" font-size="9" fill="#92400e">H=0.811, n=4</text>
<rect x="355" y="115" width="150" height="50" rx="6" fill="#fef3c7" stroke="#f59e0b" stroke-width="1.5"/>
<text x="430" y="135" text-anchor="middle" font-size="11" font-weight="bold" fill="#92400e">Income?</text>
<text x="430" y="152" text-anchor="middle" font-size="9" fill="#92400e">H=0.650, n=6</text>
<line x1="95" y1="165" x2="60" y2="210" stroke="#94a3b8" stroke-width="1.5"/>
<line x1="165" y1="165" x2="200" y2="210" stroke="#94a3b8" stroke-width="1.5"/>
<line x1="395" y1="165" x2="360" y2="210" stroke="#94a3b8" stroke-width="1.5"/>
<line x1="465" y1="165" x2="500" y2="210" stroke="#94a3b8" stroke-width="1.5"/>
<text x="68" y="196" font-size="8" fill="#64748b">Low</text>
<text x="185" y="196" font-size="8" fill="#64748b">High</text>
<text x="340" y="196" font-size="8" fill="#64748b">Low</text>
<text x="486" y="196" font-size="8" fill="#64748b">High</text>
<rect x="20" y="210" width="85" height="45" rx="4" fill="#dcfce7" stroke="#22c55e" stroke-width="1.5"/>
<text x="62" y="228" text-anchor="middle" font-size="10" font-weight="bold" fill="#15803d">No ✓</text>
<text x="62" y="242" text-anchor="middle" font-size="8" fill="#15803d">n=2, H=0</text>
<text x="62" y="252" text-anchor="middle" font-size="8" fill="#15803d">pure</text>
<rect x="160" y="210" width="85" height="45" rx="4" fill="#fee2e2" stroke="#ef4444" stroke-width="1.5"/>
<text x="202" y="228" text-anchor="middle" font-size="10" font-weight="bold" fill="#991b1b">No (tie)</text>
<text x="202" y="242" text-anchor="middle" font-size="8" fill="#991b1b">n=2, H=1.0</text>
<text x="202" y="252" text-anchor="middle" font-size="8" fill="#991b1b">50/50</text>
<rect x="320" y="210" width="85" height="45" rx="4" fill="#fee2e2" stroke="#ef4444" stroke-width="1.5"/>
<text x="362" y="228" text-anchor="middle" font-size="10" font-weight="bold" fill="#991b1b">Yes (tie)</text>
<text x="362" y="242" text-anchor="middle" font-size="8" fill="#991b1b">n=2, H=1.0</text>
<text x="362" y="252" text-anchor="middle" font-size="8" fill="#991b1b">50/50</text>
<rect x="460" y="210" width="85" height="45" rx="4" fill="#dcfce7" stroke="#22c55e" stroke-width="1.5"/>
<text x="502" y="228" text-anchor="middle" font-size="10" font-weight="bold" fill="#15803d">Yes ✓</text>
<text x="502" y="242" text-anchor="middle" font-size="8" fill="#15803d">n=4, H=0</text>
<text x="502" y="252" text-anchor="middle" font-size="8" fill="#15803d">pure</text>
Two leaves are pure (green); two are impure ties (red) — resolved by the parent's majority class.
Prediction Trace for New Samples
| Income | Employed | Tree path | Prediction |
|---|---|---|---|
| Low | No | Employed=No → Income=Low → Leaf | No |
| High | Yes | Employed=Yes → Income=High → Leaf | Yes |
| High | No | Employed=No → Income=High → Tie leaf | No |
| Low | Yes | Employed=Yes → Income=Low → Tie leaf | Yes |
Information Gain Ratio (C4.5 Variant)
Plain Information Gain is biased toward features with many distinct values. A feature that assigns each sample to its own unique bucket always achieves IG = H(root) — perfect information, but it just memorizes the data (like using a unique sample ID).
The fix: normalize IG by the entropy of the feature itself:
For Income (5 Low, 5 High — equal split):
For Employed (4 No, 6 Yes — unequal split):
Both Gain Ratio and plain IG select Employed as the root split. For this dataset, the adjustment doesn't change the decision — but on datasets with high-cardinality features (zip code, user ID), it prevents the model from trivially splitting on a unique identifier.
sklearn Tree Visualization
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.preprocessing import LabelEncoder
import numpy as np
le_inc = LabelEncoder(); le_emp = LabelEncoder(); le_app = LabelEncoder()
X = np.column_stack([le_inc.fit_transform(data['income']),
le_emp.fit_transform(data['employed'])])
y = le_app.fit_transform(data['approved'])
dt = DecisionTreeClassifier(criterion='entropy', max_depth=2, random_state=42)
dt.fit(X, y)
print(export_text(dt, feature_names=['income', 'employed']))|--- employed <= 0.50
| |--- income <= 0.50
| | |--- class: 0
| |--- income > 0.50
| | |--- class: 0
|--- employed > 0.50
| |--- income <= 0.50
| | |--- class: 1
| |--- income > 0.50
| | |--- class: 1
The root split is on employed — matches our manual calculation. Low Income (≤0.5) maps to No (class 0) under both Employed=No branches.
Test Your Understanding
-
At the tie leaf (Employed=No, Income=High): 1 Yes, 1 No. The tree predicts No (following the parent majority). If this leaf received 10 new predictions where 7 were actually Yes, what would be the local accuracy of this leaf? Is this evidence of underfitting or a data limitation?
-
Information Gain at the root was IG(Employed)=0.168. At the right node (Employed=Yes), IG(Income)=0.317 — larger than the root. Does this mean Income is a better feature than Employed overall? Explain why deeper nodes can have higher IG than the root.
-
GainRatio penalizes features with unequal splits. A feature with only 1 sample in one branch and 9 in the other has bits — lower than a 50/50 split. If IG = 0.2 bits for this feature, what is its GainRatio? Is this higher or lower than a feature with IG=0.15 and a 50/50 split?
-
The tree has depth 2 with 4 leaves for 10 samples. The unpruned tree (no max_depth) could create up to 10 leaves. Compute the entropy of each possible pure leaf (1 sample each). What would be the training accuracy?
-
export_textencodes "Low" and "High" as 0 and 1 (alphabetical LabelEncoder order). If you forgot to check the encoding and assumed 0=High and 1=Low, your prediction for {income=High, employed=Yes} would be wrong. How would you verify the encoding direction from the sklearn output?