Back to blog
← View series: machine learning

~/blog

Splitting Numerical Features in Decision Trees

Jun 26, 20266 min readBy Mohammed Vasim
Machine LearningAIData Science

The previous posts used categorical features with a finite set of values. For numerical features like sq_ft or age, the split can occur at infinitely many points. Decision trees handle this by generating a finite set of threshold candidates, evaluating each, and picking the best one.

Anchor dataset: 8-sample house classification. Predict whether a house is expensive (price > $300k).

python
import numpy as np
import pandas as pd

data = pd.DataFrame({
    'sq_ft':     [650, 850, 1100, 1200, 1400, 1600, 1900, 2100],
    'bedrooms':  [2, 2, 3, 3, 3, 4, 4, 5],
    'expensive': [0, 0, 0, 0, 1, 1, 1, 1]  # 1 if price > $300k
})
# 4 affordable (0), 4 expensive (1). Root H = 1.0 bit (perfectly mixed)

Root entropy: bit.

Why Categorical Splits Don't Work for Numbers

For categorical features with values, we test each possible binary partition (left branch vs right branch). For 2 values: one partition. For 3 values: 3 partitions.

For a continuous feature with unique values: there are distinct orderings — but infinitely many possible split points. The solution: threshold candidates are the midpoints between consecutive sorted unique values.

Any split point between and produces the same left/right partition. The exact midpoint is conventional — what matters is which samples go left and which go right.

Generating Threshold Candidates for sq_ft

Sorted unique values: [650, 850, 1100, 1200, 1400, 1600, 1900, 2100]

7 midpoint candidates:

Threshold Left: sq_ft Right: sq_ft
[650][850,1100,1200,1400,1600,1900,2100]
[650,850][1100,1200,1400,1600,1900,2100]
[650,850,1100][1200,1400,1600,1900,2100]
[650,850,1100,1200][1400,1600,1900,2100]
[650,850,1100,1200,1400][1600,1900,2100]
[650,850,1100,1200,1400,1600][1900,2100]
[650,...,1900][2100]

Computing IG for Each Threshold

Root bit. For each threshold: compute left/right class distributions, entropy, weighted average, and IG.

Left Left Right Right Weighted IG
750[0]0.0[0,0,0,1,1,1,1]0.5920.482
975[0,0]0.0[0,0,1,1,1,1]1.0000.250
1150[0,0,0]0.0[0,1,1,1,1]0.7220.549
1300[0,0,0,0]0.0[1,1,1,1]0.0(4/8)(0)+(4/8)(0)=01.0 ★
1500[0,0,0,0,1]0.722[1,1,1]0.00.549
1750[0,0,0,0,1,1]1.000[1,1]0.00.250
2000[0,0,0,0,1,1,1]0.985[1]0.00.137

Computing for right node (6 expensive out of 7, 1 affordable):

Best split: with IG = 1.0 — all 4 affordable houses have sq_ft ≤ 1300 and all 4 expensive have sq_ft > 1300. Perfect separation in one split.

sq_ft thresholds Information Gain per threshold <rect x="10" y="18" width="270" height="200" fill="#f8fafc" stroke="#e2e8f0" stroke-width="1"/> <rect x="300" y="18" width="270" height="200" fill="#f8fafc" stroke="#e2e8f0" stroke-width="1"/> <line x1="10" y1="218" x2="280" y2="218" stroke="#334155" stroke-width="1.5"/> <text x="145" y="232" text-anchor="middle" font-size="9" fill="#334155">sq_ft</text> <line x1="300" y1="218" x2="570" y2="218" stroke="#334155" stroke-width="1.5"/> <circle cx="25" cy="190" r="7" fill="#3b82f6"/> <circle cx="55" cy="190" r="7" fill="#3b82f6"/> <circle cx="87" cy="190" r="7" fill="#3b82f6"/> <circle cx="101" cy="190" r="7" fill="#3b82f6"/> <rect x="118" y="184" width="13" height="13" fill="#ef4444"/> <rect x="147" y="184" width="13" height="13" fill="#ef4444"/> <rect x="191" y="184" width="13" height="13" fill="#ef4444"/> <rect x="241" y="184" width="13" height="13" fill="#ef4444"/> <line x1="38" y1="18" x2="38" y2="218" stroke="#94a3b8" stroke-width="1" stroke-dasharray="2,2"/> <line x1="70" y1="18" x2="70" y2="218" stroke="#94a3b8" stroke-width="1" stroke-dasharray="2,2"/> <line x1="94" y1="18" x2="94" y2="218" stroke="#94a3b8" stroke-width="1" stroke-dasharray="2,2"/> <line x1="110" y1="18" x2="110" y2="218" stroke="#22c55e" stroke-width="2.5"/> <text x="113" y="35" font-size="9" fill="#22c55e" font-weight="bold">t=1300</text> <text x="113" y="47" font-size="9" fill="#22c55e" font-weight="bold">IG=1.0</text> <line x1="132" y1="18" x2="132" y2="218" stroke="#94a3b8" stroke-width="1" stroke-dasharray="2,2"/> <line x1="170" y1="18" x2="170" y2="218" stroke="#94a3b8" stroke-width="1" stroke-dasharray="2,2"/> <line x1="218" y1="18" x2="218" y2="218" stroke="#94a3b8" stroke-width="1" stroke-dasharray="2,2"/> <text x="11" y="214" font-size="7" fill="#64748b">650</text> <text x="230" y="214" font-size="7" fill="#64748b">2100</text> <rect x="310" y="122" width="22" height="96" fill="#94a3b8" rx="2"/> <rect x="342" y="168" width="22" height="50" fill="#94a3b8" rx="2"/> <rect x="374" y="108" width="22" height="110" fill="#94a3b8" rx="2"/> <rect x="406" y="18" width="22" height="200" fill="#22c55e" rx="2"/> <rect x="438" y="108" width="22" height="110" fill="#94a3b8" rx="2"/> <rect x="470" y="168" width="22" height="50" fill="#94a3b8" rx="2"/> <rect x="502" y="192" width="22" height="26" fill="#94a3b8" rx="2"/> <text x="321" y="213" font-size="7" fill="#64748b">750</text> <text x="353" y="213" font-size="7" fill="#64748b">975</text> <text x="380" y="213" font-size="7" fill="#64748b">1150</text> <text x="407" y="213" font-size="7" fill="#22c55e" font-weight="bold">1300</text> <text x="435" y="213" font-size="7" fill="#64748b">1500</text> <text x="465" y="213" font-size="7" fill="#64748b">1750</text> <text x="500" y="213" font-size="7" fill="#64748b">2000</text> <text x="410" y="15" font-size="9" fill="#22c55e">1.0</text>

All affordable houses (blue circles) are left of the green line; all expensive (red squares) are right. The IG bar chart shows towering above all others.

Threshold Candidates for Bedrooms

Sorted unique bedroom values: [2, 3, 4, 5]. Three candidates: 2.5, 3.5, 4.5.

Left Left Right Right Weighted IG
2.5[0,0]0.0[0,0,1,1,1,1]1.0000.250
3.5[0,0,0,0]0.0[1,1,1,1]0.001.0 ★
4.5[0,0,0,0,1]0.722[1,1]0.00.549

bedrooms ≤ 3.5 also achieves IG = 1.0 — a perfect tie with sq_ft ≤ 1300.

How the Algorithm Handles Tied Splits

When multiple thresholds (or features) produce the same IG, sklearn picks the first feature encountered in sorted order. With sq_ft as feature 0 and bedrooms as feature 1, sq_ft wins the tie.

This means feature column order in the input matrix can affect the tree structure when splits are tied. In practice, exact ties are rare on large datasets with continuous features — unique thresholds from different features rarely produce identical IG.

sklearn Confirmation

python
from sklearn.tree import DecisionTreeClassifier, export_text
import numpy as np

X = data[['sq_ft', 'bedrooms']].values
y = data['expensive'].values

dt = DecisionTreeClassifier(criterion='entropy', max_depth=1, random_state=42)
dt.fit(X, y)

print("Root split:")
print(f"  Feature: {'sq_ft' if dt.tree_.feature[0]==0 else 'bedrooms'}")
print(f"  Threshold: {dt.tree_.threshold[0]:.2f}")
print(f"  Left samples: {dt.tree_.n_node_samples[1]}")
print(f"  Right samples: {dt.tree_.n_node_samples[2]}")
print()
print(export_text(dt, feature_names=['sq_ft', 'bedrooms']))
Root split: Feature: sq_ft Threshold: 1300.00 Left samples: 4 Right samples: 4 |--- sq_ft <= 1300.00 | |--- class: 0 |--- sq_ft > 1300.00 | |--- class: 1

One split, two leaves, 100% training accuracy. The threshold is exactly 1300 (sklearn uses the actual feature value at the split, not the midpoint).

Computational Complexity

At each node:

  • For each of features: sort values → , then scan thresholds →
  • Total per node:

For a full tree of depth (worst case: balanced binary tree with nodes, each containing samples):

For a full tree with (one sample per leaf): .

High Cardinality and max_features

With 100,000 unique sq_ft values: 99,999 threshold candidates per feature. With 100 features: 10 million IG computations per node.

sklearn's max_features='sqrt' subsamples features at each split: only features are considered. This is the key trick used in Random Forest — each tree considers a random subset of features, making trees different from each other (decorrelated), which reduces variance when averaging.

Test Your Understanding

  1. For : the left node has 1 sample (all affordable), the right has 7. The weighted entropy is . IG = 0.482. Why isn't this as good as (IG=1.0), even though the left node is also pure?

  2. The algorithm evaluates midpoints, not arbitrary real-valued thresholds. Prove that using the midpoint vs any value between and produces the same left/right partition and hence the same IG.

  3. bedrooms ≤ 3.5 and sq_ft ≤ 1300 both achieve IG=1.0. If bedrooms were listed first in the DataFrame (column 0), which split would sklearn choose? Write the exact code change needed to verify this.

  4. Computational complexity is per node. sklearn's max_features='sqrt' reduces this to . For and : compute the speedup factor. What is the tradeoff?

  5. After the root split at , both child nodes are pure (). Would the tree continue splitting if you removed the max_depth=1 constraint? What does sklearn do at a pure node?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment