Back to blog
← View series: machine learning

Decision Trees: Entropy and Gini Impurity Information Gain and Full Tree Construction Splitting Numerical Features in Decision Trees Decision Tree Pruning: Pre-Pruning and Post-Pruning Decision Tree Regression Decision Tree: Diabetes Prediction Project

~/blog

Splitting Numerical Features in Decision Trees

Jun 26, 2026•6 min read•By Mohammed Vasim

Machine LearningAIData Science

The previous posts used categorical features with a finite set of values. For numerical features like sq_ft or age, the split can occur at infinitely many points. Decision trees handle this by generating a finite set of threshold candidates, evaluating each, and picking the best one.

Anchor dataset: 8-sample house classification. Predict whether a house is expensive (price > $300k).

python

import numpy as np
import pandas as pd

data = pd.DataFrame({
    'sq_ft':     [650, 850, 1100, 1200, 1400, 1600, 1900, 2100],
    'bedrooms':  [2, 2, 3, 3, 3, 4, 4, 5],
    'expensive': [0, 0, 0, 0, 1, 1, 1, 1]  # 1 if price > $300k
})
# 4 affordable (0), 4 expensive (1). Root H = 1.0 bit (perfectly mixed)

Root entropy: $H = - (0.5 lo g_{2} 0.5 + 0.5 lo g_{2} 0.5) = 1.0$ bit.

Why Categorical Splits Don't Work for Numbers

For categorical features with $K$ values, we test each possible binary partition (left branch vs right branch). For 2 values: one partition. For 3 values: 3 partitions.

For a continuous feature with $n$ unique values: there are $n - 1$ distinct orderings — but infinitely many possible split points. The solution: threshold candidates are the $n - 1$ midpoints between consecutive sorted unique values.

Any split point between $x_{i}$ and $x_{i + 1}$ produces the same left/right partition. The exact midpoint is conventional — what matters is which samples go left and which go right.

Generating Threshold Candidates for sq_ft

Sorted unique values: [650, 850, 1100, 1200, 1400, 1600, 1900, 2100]

7 midpoint candidates:

Threshold $t$	Left: sq_ft $\leq t$	Right: sq_ft $> t$
$(650 + 850) /2 = 750$	[650]	[850,1100,1200,1400,1600,1900,2100]
$(850 + 1100) /2 = 975$	[650,850]	[1100,1200,1400,1600,1900,2100]
$(1100 + 1200) /2 = 1150$	[650,850,1100]	[1200,1400,1600,1900,2100]
$(1200 + 1400) /2 = 1300$	[650,850,1100,1200]	[1400,1600,1900,2100]
$(1400 + 1600) /2 = 1500$	[650,850,1100,1200,1400]	[1600,1900,2100]
$(1600 + 1900) /2 = 1750$	[650,850,1100,1200,1400,1600]	[1900,2100]
$(1900 + 2100) /2 = 2000$	[650,...,1900]	[2100]

Computing IG for Each Threshold

Root $H = 1.0$ bit. For each threshold: compute left/right class distributions, entropy, weighted average, and IG.

$t$	Left $y$	Left $H$	Right $y$	Right $H$	Weighted $H$	IG
750	[0]	0.0	[0,0,0,1,1,1,1]	0.592	$(1/8) (0) + (7/8) (0.592) = 0.518$	0.482
975	[0,0]	0.0	[0,0,1,1,1,1]	1.000	$(2/8) (0) + (6/8) (1.0) = 0.750$	0.250
1150	[0,0,0]	0.0	[0,1,1,1,1]	0.722	$(3/8) (0) + (5/8) (0.722) = 0.451$	0.549
1300	[0,0,0,0]	0.0	[1,1,1,1]	0.0	(4/8)(0)+(4/8)(0)=0	1.0 ★
1500	[0,0,0,0,1]	0.722	[1,1,1]	0.0	$(5/8) (0.722) = 0.451$	0.549
1750	[0,0,0,0,1,1]	1.000	[1,1]	0.0	$(6/8) (1.0) = 0.750$	0.250
2000	[0,0,0,0,1,1,1]	0.985	[1]	0.0	$(7/8) (0.985) = 0.863$	0.137

Computing $H$ for $t = 750$ right node (6 expensive out of 7, 1 affordable):

$H = - (1/7) lo g_{2} (1/7) - (6/7) lo g_{2} (6/7) = - (1/7) (- 2.807) - (6/7) (- 0.222) = 0.401 + 0.190 = 0.592$

Best split: $t = 1300$ with IG = 1.0 — all 4 affordable houses have sq_ft ≤ 1300 and all 4 expensive have sq_ft > 1300. Perfect separation in one split.

<rect x="10" y="18" width="270" height="200" fill="#f8fafc" stroke="#e2e8f0" stroke-width="1"/>
<rect x="300" y="18" width="270" height="200" fill="#f8fafc" stroke="#e2e8f0" stroke-width="1"/>

<line x1="10" y1="218" x2="280" y2="218" stroke="#334155" stroke-width="1.5"/>
<text x="145" y="232" text-anchor="middle" font-size="9" fill="#334155">sq_ft</text>
<line x1="300" y1="218" x2="570" y2="218" stroke="#334155" stroke-width="1.5"/>

<circle cx="25" cy="190" r="7" fill="#3b82f6"/>
<circle cx="55" cy="190" r="7" fill="#3b82f6"/>
<circle cx="87" cy="190" r="7" fill="#3b82f6"/>
<circle cx="101" cy="190" r="7" fill="#3b82f6"/>
<rect x="118" y="184" width="13" height="13" fill="#ef4444"/>
<rect x="147" y="184" width="13" height="13" fill="#ef4444"/>
<rect x="191" y="184" width="13" height="13" fill="#ef4444"/>
<rect x="241" y="184" width="13" height="13" fill="#ef4444"/>

<line x1="38" y1="18" x2="38" y2="218" stroke="#94a3b8" stroke-width="1" stroke-dasharray="2,2"/>
<line x1="70" y1="18" x2="70" y2="218" stroke="#94a3b8" stroke-width="1" stroke-dasharray="2,2"/>
<line x1="94" y1="18" x2="94" y2="218" stroke="#94a3b8" stroke-width="1" stroke-dasharray="2,2"/>
<line x1="110" y1="18" x2="110" y2="218" stroke="#22c55e" stroke-width="2.5"/>
<text x="113" y="35" font-size="9" fill="#22c55e" font-weight="bold">t=1300</text>
<text x="113" y="47" font-size="9" fill="#22c55e" font-weight="bold">IG=1.0</text>
<line x1="132" y1="18" x2="132" y2="218" stroke="#94a3b8" stroke-width="1" stroke-dasharray="2,2"/>
<line x1="170" y1="18" x2="170" y2="218" stroke="#94a3b8" stroke-width="1" stroke-dasharray="2,2"/>
<line x1="218" y1="18" x2="218" y2="218" stroke="#94a3b8" stroke-width="1" stroke-dasharray="2,2"/>

<text x="11" y="214" font-size="7" fill="#64748b">650</text>
<text x="230" y="214" font-size="7" fill="#64748b">2100</text>

<rect x="310" y="122" width="22" height="96" fill="#94a3b8" rx="2"/>
<rect x="342" y="168" width="22" height="50" fill="#94a3b8" rx="2"/>
<rect x="374" y="108" width="22" height="110" fill="#94a3b8" rx="2"/>
<rect x="406" y="18" width="22" height="200" fill="#22c55e" rx="2"/>
<rect x="438" y="108" width="22" height="110" fill="#94a3b8" rx="2"/>
<rect x="470" y="168" width="22" height="50" fill="#94a3b8" rx="2"/>
<rect x="502" y="192" width="22" height="26" fill="#94a3b8" rx="2"/>

<text x="321" y="213" font-size="7" fill="#64748b">750</text>
<text x="353" y="213" font-size="7" fill="#64748b">975</text>
<text x="380" y="213" font-size="7" fill="#64748b">1150</text>
<text x="407" y="213" font-size="7" fill="#22c55e" font-weight="bold">1300</text>
<text x="435" y="213" font-size="7" fill="#64748b">1500</text>
<text x="465" y="213" font-size="7" fill="#64748b">1750</text>
<text x="500" y="213" font-size="7" fill="#64748b">2000</text>

<text x="410" y="15" font-size="9" fill="#22c55e">1.0</text>

All affordable houses (blue circles) are left of the green line; all expensive (red squares) are right. The IG bar chart shows $t = 1300$ towering above all others.

Threshold Candidates for Bedrooms

Sorted unique bedroom values: [2, 3, 4, 5]. Three candidates: 2.5, 3.5, 4.5.

$t$	Left $y$	Left $H$	Right $y$	Right $H$	Weighted $H$	IG
2.5	[0,0]	0.0	[0,0,1,1,1,1]	1.000	$(2/8) (0) + (6/8) (1.0) = 0.750$	0.250
3.5	[0,0,0,0]	0.0	[1,1,1,1]	0.0	0	1.0 ★
4.5	[0,0,0,0,1]	0.722	[1,1]	0.0	$(5/8) (0.722) = 0.451$	0.549

bedrooms ≤ 3.5 also achieves IG = 1.0 — a perfect tie with sq_ft ≤ 1300.

How the Algorithm Handles Tied Splits

When multiple thresholds (or features) produce the same IG, sklearn picks the first feature encountered in sorted order. With sq_ft as feature 0 and bedrooms as feature 1, sq_ft wins the tie.

This means feature column order in the input matrix can affect the tree structure when splits are tied. In practice, exact ties are rare on large datasets with continuous features — unique thresholds from different features rarely produce identical IG.

sklearn Confirmation

python

from sklearn.tree import DecisionTreeClassifier, export_text
import numpy as np

X = data[['sq_ft', 'bedrooms']].values
y = data['expensive'].values

dt = DecisionTreeClassifier(criterion='entropy', max_depth=1, random_state=42)
dt.fit(X, y)

print("Root split:")
print(f"  Feature: {'sq_ft' if dt.tree_.feature[0]==0 else 'bedrooms'}")
print(f"  Threshold: {dt.tree_.threshold[0]:.2f}")
print(f"  Left samples: {dt.tree_.n_node_samples[1]}")
print(f"  Right samples: {dt.tree_.n_node_samples[2]}")
print()
print(export_text(dt, feature_names=['sq_ft', 'bedrooms']))

Root split:
  Feature: sq_ft
  Threshold: 1300.00
  Left samples: 4
  Right samples: 4

|--- sq_ft <= 1300.00
|   |--- class: 0
|--- sq_ft >  1300.00
|   |--- class: 1

One split, two leaves, 100% training accuracy. The threshold is exactly 1300 (sklearn uses the actual feature value at the split, not the midpoint).

Computational Complexity

At each node:

For each of $p$ features: sort $n$ values → $O (n lo g n)$ , then scan $n - 1$ thresholds → $O (n)$
Total per node: $O (n p lo g n)$

For a full tree of depth $d$ (worst case: balanced binary tree with $2^{d}$ nodes, each containing $n / 2^{d}$ samples):

$Total = \sum_{k = 0}^{d} 2^{k} \times O (\frac{n}{2 ^{k}} \cdot p \cdot lo g \frac{n}{2 ^{k}}) \approx O (n \cdot p \cdot d)$

For a full tree with $d = lo g n$ (one sample per leaf): $O (n p lo g n)$ .

High Cardinality and max_features

With 100,000 unique sq_ft values: 99,999 threshold candidates per feature. With 100 features: 10 million IG computations per node.

sklearn's max_features='sqrt' subsamples features at each split: only $p$ features are considered. This is the key trick used in Random Forest — each tree considers a random subset of features, making trees different from each other (decorrelated), which reduces variance when averaging.

Test Your Understanding

For $t = 750$ : the left node has 1 sample (all affordable), the right has 7. The weighted entropy is $0 + (7/8) \times H (1/7, 6/7) = 0.518$ . IG = 0.482. Why isn't this as good as $t = 1300$ (IG=1.0), even though the left node is also pure?
The algorithm evaluates $n - 1$ midpoints, not arbitrary real-valued thresholds. Prove that using the midpoint vs any value between $x_{i}$ and $x_{i + 1}$ produces the same left/right partition and hence the same IG.
bedrooms ≤ 3.5 and sq_ft ≤ 1300 both achieve IG=1.0. If bedrooms were listed first in the DataFrame (column 0), which split would sklearn choose? Write the exact code change needed to verify this.
Computational complexity is $O (n p lo g n)$ per node. sklearn's max_features='sqrt' reduces this to $O (n p lo g n)$ . For $n = 10, 000$ and $p = 100$ : compute the speedup factor. What is the tradeoff?
After the root split at $t = 1300$ , both child nodes are pure ( $H = 0$ ). Would the tree continue splitting if you removed the max_depth=1 constraint? What does sklearn do at a pure node?