Back to blog
← View series: machine learning

Decision Trees: Entropy and Gini Impurity Information Gain and Full Tree Construction Splitting Numerical Features in Decision Trees Decision Tree Pruning: Pre-Pruning and Post-Pruning Decision Tree Regression Decision Tree: Diabetes Prediction Project

~/blog

Decision Tree Regression

Jun 26, 2026•7 min read•By Mohammed Vasim

Machine LearningAIData Science

A classification tree uses entropy or Gini to measure node impurity. A regression tree uses variance — the same algorithmic skeleton, a different objective. At each leaf, instead of a majority vote, it predicts the mean of the samples that reached it.

Anchor dataset: 6 houses with sq_ft and price — the same anchor from the linear regression section.

python

import numpy as np

X = np.array([650, 850, 1100, 1400, 1600, 1900]).reshape(-1, 1)
y = np.array([180, 220, 280, 340, 370, 430])
# True OLS: ŷ = 53.33 + 0.20×sq_ft (from linear regression section)

Regression Trees vs Classification Trees

Aspect	Classification Tree	Regression Tree
Impurity measure	Entropy / Gini	Variance (MSE)
Leaf prediction	Majority class	Mean of leaf samples
Split criterion	Maximize IG or Gini gain	Maximize variance reduction
Output type	Discrete class	Continuous value

Everything else — threshold search, stopping conditions, pruning — is identical.

Root Node Variance

The variance at a node measures how spread out the target values are:

$Var (S) = \frac{1}{n} \sum_{i} (y_{i} - \overset{y}{ˉ})^{2}$

At the root: $\overset{y}{ˉ} = (180 + 220 + 280 + 340 + 370 + 430) /6 = 1820/6 = 303.33$

$Var (root) = \frac{( 180 - 303.33 ) ^{2} + ( 220 - 303.33 ) ^{2} + ( 280 - 303.33 ) ^{2} + ( 340 - 303.33 ) ^{2} + ( 370 - 303.33 ) ^{2} + ( 430 - 303.33 ) ^{2}}{6}$

$= \frac{15210.9 + 6943.9 + 544.3 + 1344.3 + 4444.9 + 16045.3}{6} = \frac{44533.6}{6} = 7422.3$

Threshold Search for sq_ft

Split criterion for regression:

$IG_{reg} (S, t) = Var (S) - \frac{∣ S _{l} ∣}{∣ S ∣} Var (S_{l}) - \frac{∣ S _{r} ∣}{∣ S ∣} Var (S_{r})$

5 midpoint candidates: 750, 975, 1250, 1500, 1750.

$t$	Left $y$	Left $\overset{y}{ˉ}$	Left Var	Right $y$	Right $\overset{y}{ˉ}$	Right Var	Weighted Var	$IG_{reg}$
750	[180]	180	0	[220,280,340,370,430]	328	5424	$(1/6) (0) + (5/6) (5424) = 4520$	$7422 - 4520 = 2902$
975	[180,220]	200	400	[280,340,370,430]	355	3550	$(2/6) (400) + (4/6) (3550) = 2500$	$7422 - 2500 = 4922$
1250	[180,220,280]	226.7	1555	[340,370,430]	380	1133	(3/6)(1555)+(3/6)(1133)=1344	$7422 - 1344 = 6078$
1500	[180,220,280,340]	255	3350	[370,430]	400	900	$(4/6) (3350) + (2/6) (900) = 2533$	$7422 - 2533 = 4889$
1750	[180,220,280,340,370]	278	4544	[430]	430	0	$(5/6) (4544) + (1/6) (0) = 3787$	$7422 - 3787 = 3635$

Best split: $t = 1250$ with $IG_{reg} = 6078$ . The split at sq_ft = 1250 reduces the total variance from 7422 to 1344 — an 82% reduction.

Level 1: Left Node (sq_ft ≤ 1250, samples [650, 850, 1100])

$y = [180, 220, 280]$ , $\overset{y}{ˉ} = 226.7$ , $Var = 1555$ .

Test thresholds $t = 750$ and $t = 975$ :

$t = 750$ : Left=[180], Var=0; Right=[220,280], $\overset{y}{ˉ} = 250$ , Var=900. Weighted: $(1/3) (0) + (2/3) (900) = 600$ . $IG_{reg} = 1555 - 600 = 955$ .
$t = 975$ : Left=[180,220], $\overset{y}{ˉ} = 200$ , Var=400; Right=[280], Var=0. Weighted: $(2/3) (400) + (1/3) (0) = 267$ . $IG_{reg} = 1555 - 267 = 1288$ .

Best: $t = 975$ . Splits into:

Left-Left (sq_ft ≤ 975): [650, 850] → $\overset{y}{ˉ} = 200$ — predict $200k
Left-Right (975 < sq_ft ≤ 1250): [1100] → $\overset{y}{ˉ} = 280$ — predict $280k

Level 1: Right Node (sq_ft > 1250, samples [1400, 1600, 1900])

$y = [340, 370, 430]$ , $\overset{y}{ˉ} = 380$ , $Var = 1133$ .

Test thresholds $t = 1500$ and $t = 1750$ :

$t = 1500$ : Left=[340], Var=0; Right=[370,430], $\overset{y}{ˉ} = 400$ , Var=900. Weighted: $(1/3) (0) + (2/3) (900) = 600$ . $IG_{reg} = 1133 - 600 = 533$ .
$t = 1750$ : Left=[340,370], $\overset{y}{ˉ} = 355$ , Var=225; Right=[430], Var=0. Weighted: $(2/3) (225) + (1/3) (0) = 150$ . $IG_{reg} = 1133 - 150 = 983$ .

Best: $t = 1750$ . Splits into:

Right-Left (1250 < sq_ft ≤ 1750): [1400, 1600] → $\overset{y}{ˉ} = 355$ — predict $355k
Right-Right (sq_ft > 1750): [1900] → $\overset{y}{ˉ} = 430$ — predict $430k

The 4-Leaf Staircase

Leaf	Condition	Samples	$\overset{y}{^}$
1	sq_ft ≤ 975	650, 850	$200k
2	$975 <$ sq_ft $\leq 1250$	1100	$280k
3	$1250 <$ sq_ft $\leq 1750$	1400, 1600	$355k
4	sq_ft $> 1750$	1900	$430k

<text x="63" y="228" font-size="8" fill="#64748b">650</text>
<text x="108" y="228" font-size="8" fill="#64748b">850</text>
<text x="175" y="228" font-size="8" fill="#64748b">1100</text>
<text x="243" y="228" font-size="8" fill="#64748b">1400</text>
<text x="290" y="228" font-size="8" fill="#64748b">1600</text>
<text x="365" y="228" font-size="8" fill="#64748b">1900</text>

<text x="44" y="207" text-anchor="end" font-size="8" fill="#64748b">180</text>
<text x="44" y="170" text-anchor="end" font-size="8" fill="#64748b">250</text>
<text x="44" y="120" text-anchor="end" font-size="8" fill="#64748b">340</text>
<text x="44" y="62" text-anchor="end" font-size="8" fill="#64748b">430</text>

<line x1="55" y1="212" x2="495" y2="65" stroke="#3b82f6" stroke-width="1.5" stroke-dasharray="4,3"/>
<text x="450" y="60" font-size="9" fill="#3b82f6">Linear LR</text>

<polyline points="55,175 138,175 138,157 175,157 213,157 213,130 305,130 305,70 495,70" fill="none" stroke="#f59e0b" stroke-width="2.5"/>
<text x="420" y="75" font-size="9" fill="#f59e0b">Tree (staircase)</text>

<line x1="138" y1="15" x2="138" y2="215" stroke="#94a3b8" stroke-width="1" stroke-dasharray="2,2"/>
<line x1="213" y1="15" x2="213" y2="215" stroke="#94a3b8" stroke-width="1" stroke-dasharray="2,2"/>
<line x1="305" y1="15" x2="305" y2="215" stroke="#94a3b8" stroke-width="1" stroke-dasharray="2,2"/>
<text x="141" y="25" font-size="7" fill="#94a3b8">975</text>
<text x="216" y="25" font-size="7" fill="#94a3b8">1250</text>
<text x="308" y="25" font-size="7" fill="#94a3b8">1750</text>

<circle cx="68" cy="207" r="5" fill="#334155"/>
<circle cx="113" cy="177" r="5" fill="#334155"/>
<circle cx="178" cy="152" r="5" fill="#334155"/>
<circle cx="248" cy="120" r="5" fill="#334155"/>
<circle cx="293" cy="98" r="5" fill="#334155"/>
<circle cx="368" cy="65" r="5" fill="#334155"/>

The orange staircase shows tree predictions: constant within each leaf region. The blue dashed line is the linear regression fit. For this near-linear dataset, the linear model tracks the data better; the tree's staircase has visible errors at the leaf boundaries.

Predictions vs Actual — Tree vs Linear

sq_ft	$y_{true}$	Tree $\overset{y}{^}$	Linear $\overset{y}{^}$	Tree error	Linear error
650	180	200	183	20	3
850	220	200	223	20	3
1100	280	280	273	0	7
1400	340	355	333	15	7
1600	370	355	373	15	3
1900	430	430	433	0	3

$Tree MSE = (400 + 400 + 0 + 225 + 225 + 0) /6 = 208.3$ $Linear MSE = (9 + 9 + 49 + 49 + 9 + 9) /6 = 22.3$

Linear regression wins decisively on this near-linear dataset. The tree is limited by its piecewise-constant prediction: leaves 1 and 4 each contain 2 samples whose true values differ, forcing the average to miss both.

sklearn Implementation

python

from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

dt_reg = DecisionTreeRegressor(criterion='squared_error', max_depth=2, random_state=42)
dt_reg.fit(X, y)

y_pred = dt_reg.predict(X)
print(f"Tree MSE: {mean_squared_error(y, y_pred):.2f}")
print(f"Tree R²:  {dt_reg.score(X, y):.4f}")
print(f"Predictions:     {y_pred}")
print(f"Unique leaf predictions: {np.unique(y_pred)}")

Tree MSE: 208.33
Tree R²:  0.9720
Predictions: [200. 200. 280. 355. 355. 430.]
Unique leaf predictions: [200. 280. 355. 430.]

The 4 unique prediction values are the mean of each leaf: [200, 280, 355, 430]. Despite high R²=0.972, the MSE of 208 is 9× worse than the linear model's MSE of 22.

max_depth Effect on Regression

python

print(f"{'depth':>8} {'MSE':>10} {'steps':>8} {'leaves':>8}")
for d in [1, 2, 3, None]:
    dt = DecisionTreeRegressor(max_depth=d, random_state=42)
    dt.fit(X, y)
    y_p = dt.predict(X)
    mse = mean_squared_error(y, y_p)
    steps = len(np.unique(y_p))
    print(f"{str(d):>8} {mse:>10.2f} {steps:>8} {dt.get_n_leaves():>8}")

   depth        MSE    steps   leaves
       1    3041.67        2        2
       2     208.33        4        4
       3      22.22        6        6   ← one leaf per sample
    None       0.00        6        6   ← memorizes training set

At depth=3 with 6 samples and 6 leaves: each sample has its own leaf, MSE=22.22 (interpolation errors from single-sample leaves). At depth=None: MSE=0 (perfect memorization). Neither of these generalizes.

When Does a Regression Tree Beat Linear Regression?

The staircase is piecewise constant — it assumes the target is flat within each region. Linear regression assumes a global linear trend. Trees win when:

The true relationship has sharp breakpoints (e.g., a salary cap at a specific experience level)
The relationship is nonlinear with different slopes in different regions
There are strong feature interactions (the effect of feature A depends on feature B)

Linear regression wins when the true relationship is approximately linear (as here).

Test Your Understanding

At the root, the best threshold was $t = 1250$ with $IG_{reg} = 6078$ . The threshold $t = 750$ produced $IG_{reg} = 2902$ . The left node of $t = 750$ is pure (1 sample, Var=0), yet its IG is lower. Why does a single-sample pure left node not maximize variance reduction?
Leaf 1 (sq_ft ≤ 975) predicts $200 k f or b o t hh o u ses a t 650 s q_{f} t (t r u e$ 180k) and 850 sq_ft (true $220 k) . T h e p r e d i c t i o n er r or i s$ \pm$20k for both. What would the prediction be if you used the median instead of the mean? Would this reduce or increase MSE on this leaf?
At depth=3, MSE=22.22 with 6 leaves for 6 samples. Is this the irreducible error of the model, or could a depth=4 tree (if allowed) reduce it further? What would the depth=4 tree's MSE be?
The tree with max_depth=None achieves MSE=0 on training data. If you added a 7th house with sq_ft=1100 and price= $290 k (d i f f er e n t f r o m t r ainin g s am pl e 3^{'} s$ 280k), what would the depth=None tree predict for sq_ft=1100? How does the presence of two training samples at the same sq_ft affect the tree?
DecisionTreeRegressor uses MSE by default (criterion='squared_error'). An alternative is MAE (criterion='absolute_error'). The MAE-optimal leaf prediction is the median instead of the mean. For leaf 1 with $y = [180, 220]$ : the mean is 200 and the median is also 200 (average of two). For a leaf with $y = [180, 220, 1000]$ (one outlier): what is the mean vs median, and which leaf prediction minimizes MAE?

Decision Tree Regression

Regression Trees vs Classification Trees

Root Node Variance

Threshold Search for sq_ft

Level 1: Left Node (sq_ft ≤ 1250, samples [650, 850, 1100])

Level 1: Right Node (sq_ft > 1250, samples [1400, 1600, 1900])

The 4-Leaf Staircase

Predictions vs Actual — Tree vs Linear

sklearn Implementation

max_depth Effect on Regression

When Does a Regression Tree Beat Linear Regression?

Test Your Understanding

Comments (0)

Leave a comment