← View series: machine learning
~/blog
Splitting Numerical Features in Decision Trees
The previous posts used categorical features with a finite set of values. For numerical features like sq_ft or age, the split can occur at infinitely many points. Decision trees handle this by generating a finite set of threshold candidates, evaluating each, and picking the best one.
Anchor dataset: 8-sample house classification. Predict whether a house is expensive (price > $300k).
import numpy as np
import pandas as pd
data = pd.DataFrame({
'sq_ft': [650, 850, 1100, 1200, 1400, 1600, 1900, 2100],
'bedrooms': [2, 2, 3, 3, 3, 4, 4, 5],
'expensive': [0, 0, 0, 0, 1, 1, 1, 1] # 1 if price > $300k
})
# 4 affordable (0), 4 expensive (1). Root H = 1.0 bit (perfectly mixed)Root entropy: bit.
Why Categorical Splits Don't Work for Numbers
For categorical features with values, we test each possible binary partition (left branch vs right branch). For 2 values: one partition. For 3 values: 3 partitions.
For a continuous feature with unique values: there are distinct orderings — but infinitely many possible split points. The solution: threshold candidates are the midpoints between consecutive sorted unique values.
Any split point between and produces the same left/right partition. The exact midpoint is conventional — what matters is which samples go left and which go right.
Generating Threshold Candidates for sq_ft
Sorted unique values: [650, 850, 1100, 1200, 1400, 1600, 1900, 2100]
7 midpoint candidates:
| Threshold | Left: sq_ft | Right: sq_ft |
|---|---|---|
| [650] | [850,1100,1200,1400,1600,1900,2100] | |
| [650,850] | [1100,1200,1400,1600,1900,2100] | |
| [650,850,1100] | [1200,1400,1600,1900,2100] | |
| [650,850,1100,1200] | [1400,1600,1900,2100] | |
| [650,850,1100,1200,1400] | [1600,1900,2100] | |
| [650,850,1100,1200,1400,1600] | [1900,2100] | |
| [650,...,1900] | [2100] |
Computing IG for Each Threshold
Root bit. For each threshold: compute left/right class distributions, entropy, weighted average, and IG.
| Left | Left | Right | Right | Weighted | IG | |
|---|---|---|---|---|---|---|
| 750 | [0] | 0.0 | [0,0,0,1,1,1,1] | 0.592 | 0.482 | |
| 975 | [0,0] | 0.0 | [0,0,1,1,1,1] | 1.000 | 0.250 | |
| 1150 | [0,0,0] | 0.0 | [0,1,1,1,1] | 0.722 | 0.549 | |
| 1300 | [0,0,0,0] | 0.0 | [1,1,1,1] | 0.0 | (4/8)(0)+(4/8)(0)=0 | 1.0 ★ |
| 1500 | [0,0,0,0,1] | 0.722 | [1,1,1] | 0.0 | 0.549 | |
| 1750 | [0,0,0,0,1,1] | 1.000 | [1,1] | 0.0 | 0.250 | |
| 2000 | [0,0,0,0,1,1,1] | 0.985 | [1] | 0.0 | 0.137 |
Computing for right node (6 expensive out of 7, 1 affordable):
Best split: with IG = 1.0 — all 4 affordable houses have sq_ft ≤ 1300 and all 4 expensive have sq_ft > 1300. Perfect separation in one split.
<rect x="10" y="18" width="270" height="200" fill="#f8fafc" stroke="#e2e8f0" stroke-width="1"/>
<rect x="300" y="18" width="270" height="200" fill="#f8fafc" stroke="#e2e8f0" stroke-width="1"/>
<line x1="10" y1="218" x2="280" y2="218" stroke="#334155" stroke-width="1.5"/>
<text x="145" y="232" text-anchor="middle" font-size="9" fill="#334155">sq_ft</text>
<line x1="300" y1="218" x2="570" y2="218" stroke="#334155" stroke-width="1.5"/>
<circle cx="25" cy="190" r="7" fill="#3b82f6"/>
<circle cx="55" cy="190" r="7" fill="#3b82f6"/>
<circle cx="87" cy="190" r="7" fill="#3b82f6"/>
<circle cx="101" cy="190" r="7" fill="#3b82f6"/>
<rect x="118" y="184" width="13" height="13" fill="#ef4444"/>
<rect x="147" y="184" width="13" height="13" fill="#ef4444"/>
<rect x="191" y="184" width="13" height="13" fill="#ef4444"/>
<rect x="241" y="184" width="13" height="13" fill="#ef4444"/>
<line x1="38" y1="18" x2="38" y2="218" stroke="#94a3b8" stroke-width="1" stroke-dasharray="2,2"/>
<line x1="70" y1="18" x2="70" y2="218" stroke="#94a3b8" stroke-width="1" stroke-dasharray="2,2"/>
<line x1="94" y1="18" x2="94" y2="218" stroke="#94a3b8" stroke-width="1" stroke-dasharray="2,2"/>
<line x1="110" y1="18" x2="110" y2="218" stroke="#22c55e" stroke-width="2.5"/>
<text x="113" y="35" font-size="9" fill="#22c55e" font-weight="bold">t=1300</text>
<text x="113" y="47" font-size="9" fill="#22c55e" font-weight="bold">IG=1.0</text>
<line x1="132" y1="18" x2="132" y2="218" stroke="#94a3b8" stroke-width="1" stroke-dasharray="2,2"/>
<line x1="170" y1="18" x2="170" y2="218" stroke="#94a3b8" stroke-width="1" stroke-dasharray="2,2"/>
<line x1="218" y1="18" x2="218" y2="218" stroke="#94a3b8" stroke-width="1" stroke-dasharray="2,2"/>
<text x="11" y="214" font-size="7" fill="#64748b">650</text>
<text x="230" y="214" font-size="7" fill="#64748b">2100</text>
<rect x="310" y="122" width="22" height="96" fill="#94a3b8" rx="2"/>
<rect x="342" y="168" width="22" height="50" fill="#94a3b8" rx="2"/>
<rect x="374" y="108" width="22" height="110" fill="#94a3b8" rx="2"/>
<rect x="406" y="18" width="22" height="200" fill="#22c55e" rx="2"/>
<rect x="438" y="108" width="22" height="110" fill="#94a3b8" rx="2"/>
<rect x="470" y="168" width="22" height="50" fill="#94a3b8" rx="2"/>
<rect x="502" y="192" width="22" height="26" fill="#94a3b8" rx="2"/>
<text x="321" y="213" font-size="7" fill="#64748b">750</text>
<text x="353" y="213" font-size="7" fill="#64748b">975</text>
<text x="380" y="213" font-size="7" fill="#64748b">1150</text>
<text x="407" y="213" font-size="7" fill="#22c55e" font-weight="bold">1300</text>
<text x="435" y="213" font-size="7" fill="#64748b">1500</text>
<text x="465" y="213" font-size="7" fill="#64748b">1750</text>
<text x="500" y="213" font-size="7" fill="#64748b">2000</text>
<text x="410" y="15" font-size="9" fill="#22c55e">1.0</text>
All affordable houses (blue circles) are left of the green line; all expensive (red squares) are right. The IG bar chart shows towering above all others.
Threshold Candidates for Bedrooms
Sorted unique bedroom values: [2, 3, 4, 5]. Three candidates: 2.5, 3.5, 4.5.
| Left | Left | Right | Right | Weighted | IG | |
|---|---|---|---|---|---|---|
| 2.5 | [0,0] | 0.0 | [0,0,1,1,1,1] | 1.000 | 0.250 | |
| 3.5 | [0,0,0,0] | 0.0 | [1,1,1,1] | 0.0 | 0 | 1.0 ★ |
| 4.5 | [0,0,0,0,1] | 0.722 | [1,1] | 0.0 | 0.549 |
bedrooms ≤ 3.5 also achieves IG = 1.0 — a perfect tie with sq_ft ≤ 1300.
How the Algorithm Handles Tied Splits
When multiple thresholds (or features) produce the same IG, sklearn picks the first feature encountered in sorted order. With sq_ft as feature 0 and bedrooms as feature 1, sq_ft wins the tie.
This means feature column order in the input matrix can affect the tree structure when splits are tied. In practice, exact ties are rare on large datasets with continuous features — unique thresholds from different features rarely produce identical IG.
sklearn Confirmation
from sklearn.tree import DecisionTreeClassifier, export_text
import numpy as np
X = data[['sq_ft', 'bedrooms']].values
y = data['expensive'].values
dt = DecisionTreeClassifier(criterion='entropy', max_depth=1, random_state=42)
dt.fit(X, y)
print("Root split:")
print(f" Feature: {'sq_ft' if dt.tree_.feature[0]==0 else 'bedrooms'}")
print(f" Threshold: {dt.tree_.threshold[0]:.2f}")
print(f" Left samples: {dt.tree_.n_node_samples[1]}")
print(f" Right samples: {dt.tree_.n_node_samples[2]}")
print()
print(export_text(dt, feature_names=['sq_ft', 'bedrooms']))Root split:
Feature: sq_ft
Threshold: 1300.00
Left samples: 4
Right samples: 4
|--- sq_ft <= 1300.00
| |--- class: 0
|--- sq_ft > 1300.00
| |--- class: 1
One split, two leaves, 100% training accuracy. The threshold is exactly 1300 (sklearn uses the actual feature value at the split, not the midpoint).
Computational Complexity
At each node:
- For each of features: sort values → , then scan thresholds →
- Total per node:
For a full tree of depth (worst case: balanced binary tree with nodes, each containing samples):
For a full tree with (one sample per leaf): .
High Cardinality and max_features
With 100,000 unique sq_ft values: 99,999 threshold candidates per feature. With 100 features: 10 million IG computations per node.
sklearn's max_features='sqrt' subsamples features at each split: only features are considered. This is the key trick used in Random Forest — each tree considers a random subset of features, making trees different from each other (decorrelated), which reduces variance when averaging.
Test Your Understanding
-
For : the left node has 1 sample (all affordable), the right has 7. The weighted entropy is . IG = 0.482. Why isn't this as good as (IG=1.0), even though the left node is also pure?
-
The algorithm evaluates midpoints, not arbitrary real-valued thresholds. Prove that using the midpoint vs any value between and produces the same left/right partition and hence the same IG.
-
bedrooms ≤ 3.5andsq_ft ≤ 1300both achieve IG=1.0. If bedrooms were listed first in the DataFrame (column 0), which split would sklearn choose? Write the exact code change needed to verify this. -
Computational complexity is per node. sklearn's
max_features='sqrt'reduces this to . For and : compute the speedup factor. What is the tradeoff? -
After the root split at , both child nodes are pure (). Would the tree continue splitting if you removed the
max_depth=1constraint? What does sklearn do at a pure node?