← View series: machine learning
~/blog
Decision Tree Regression
A classification tree uses entropy or Gini to measure node impurity. A regression tree uses variance — the same algorithmic skeleton, a different objective. At each leaf, instead of a majority vote, it predicts the mean of the samples that reached it.
Anchor dataset: 6 houses with sq_ft and price — the same anchor from the linear regression section.
import numpy as np
X = np.array([650, 850, 1100, 1400, 1600, 1900]).reshape(-1, 1)
y = np.array([180, 220, 280, 340, 370, 430])
# True OLS: ŷ = 53.33 + 0.20×sq_ft (from linear regression section)Regression Trees vs Classification Trees
| Aspect | Classification Tree | Regression Tree |
|---|---|---|
| Impurity measure | Entropy / Gini | Variance (MSE) |
| Leaf prediction | Majority class | Mean of leaf samples |
| Split criterion | Maximize IG or Gini gain | Maximize variance reduction |
| Output type | Discrete class | Continuous value |
Everything else — threshold search, stopping conditions, pruning — is identical.
Root Node Variance
The variance at a node measures how spread out the target values are:
At the root:
Threshold Search for sq_ft
Split criterion for regression:
5 midpoint candidates: 750, 975, 1250, 1500, 1750.
| Left | Left | Left Var | Right | Right | Right Var | Weighted Var | ||
|---|---|---|---|---|---|---|---|---|
| 750 | [180] | 180 | 0 | [220,280,340,370,430] | 328 | 5424 | ||
| 975 | [180,220] | 200 | 400 | [280,340,370,430] | 355 | 3550 | ||
| 1250 | [180,220,280] | 226.7 | 1555 | [340,370,430] | 380 | 1133 | (3/6)(1555)+(3/6)(1133)=1344 | |
| 1500 | [180,220,280,340] | 255 | 3350 | [370,430] | 400 | 900 | ||
| 1750 | [180,220,280,340,370] | 278 | 4544 | [430] | 430 | 0 |
Best split: with . The split at sq_ft = 1250 reduces the total variance from 7422 to 1344 — an 82% reduction.
Level 1: Left Node (sq_ft ≤ 1250, samples [650, 850, 1100])
, , .
Test thresholds and :
- : Left=[180], Var=0; Right=[220,280], , Var=900. Weighted: . .
- : Left=[180,220], , Var=400; Right=[280], Var=0. Weighted: . .
Best: . Splits into:
- Left-Left (sq_ft ≤ 975): [650, 850] → — predict $200k
- Left-Right (975 < sq_ft ≤ 1250): [1100] → — predict $280k
Level 1: Right Node (sq_ft > 1250, samples [1400, 1600, 1900])
, , .
Test thresholds and :
- : Left=[340], Var=0; Right=[370,430], , Var=900. Weighted: . .
- : Left=[340,370], , Var=225; Right=[430], Var=0. Weighted: . .
Best: . Splits into:
- Right-Left (1250 < sq_ft ≤ 1750): [1400, 1600] → — predict $355k
- Right-Right (sq_ft > 1750): [1900] → — predict $430k
The 4-Leaf Staircase
| Leaf | Condition | Samples | |
|---|---|---|---|
| 1 | sq_ft ≤ 975 | 650, 850 | $200k |
| 2 | sq_ft | 1100 | $280k |
| 3 | sq_ft | 1400, 1600 | $355k |
| 4 | sq_ft | 1900 | $430k |
<text x="63" y="228" font-size="8" fill="#64748b">650</text>
<text x="108" y="228" font-size="8" fill="#64748b">850</text>
<text x="175" y="228" font-size="8" fill="#64748b">1100</text>
<text x="243" y="228" font-size="8" fill="#64748b">1400</text>
<text x="290" y="228" font-size="8" fill="#64748b">1600</text>
<text x="365" y="228" font-size="8" fill="#64748b">1900</text>
<text x="44" y="207" text-anchor="end" font-size="8" fill="#64748b">180</text>
<text x="44" y="170" text-anchor="end" font-size="8" fill="#64748b">250</text>
<text x="44" y="120" text-anchor="end" font-size="8" fill="#64748b">340</text>
<text x="44" y="62" text-anchor="end" font-size="8" fill="#64748b">430</text>
<line x1="55" y1="212" x2="495" y2="65" stroke="#3b82f6" stroke-width="1.5" stroke-dasharray="4,3"/>
<text x="450" y="60" font-size="9" fill="#3b82f6">Linear LR</text>
<polyline points="55,175 138,175 138,157 175,157 213,157 213,130 305,130 305,70 495,70" fill="none" stroke="#f59e0b" stroke-width="2.5"/>
<text x="420" y="75" font-size="9" fill="#f59e0b">Tree (staircase)</text>
<line x1="138" y1="15" x2="138" y2="215" stroke="#94a3b8" stroke-width="1" stroke-dasharray="2,2"/>
<line x1="213" y1="15" x2="213" y2="215" stroke="#94a3b8" stroke-width="1" stroke-dasharray="2,2"/>
<line x1="305" y1="15" x2="305" y2="215" stroke="#94a3b8" stroke-width="1" stroke-dasharray="2,2"/>
<text x="141" y="25" font-size="7" fill="#94a3b8">975</text>
<text x="216" y="25" font-size="7" fill="#94a3b8">1250</text>
<text x="308" y="25" font-size="7" fill="#94a3b8">1750</text>
<circle cx="68" cy="207" r="5" fill="#334155"/>
<circle cx="113" cy="177" r="5" fill="#334155"/>
<circle cx="178" cy="152" r="5" fill="#334155"/>
<circle cx="248" cy="120" r="5" fill="#334155"/>
<circle cx="293" cy="98" r="5" fill="#334155"/>
<circle cx="368" cy="65" r="5" fill="#334155"/>
The orange staircase shows tree predictions: constant within each leaf region. The blue dashed line is the linear regression fit. For this near-linear dataset, the linear model tracks the data better; the tree's staircase has visible errors at the leaf boundaries.
Predictions vs Actual — Tree vs Linear
| sq_ft | Tree | Linear | Tree error | Linear error | |
|---|---|---|---|---|---|
| 650 | 180 | 200 | 183 | 20 | 3 |
| 850 | 220 | 200 | 223 | 20 | 3 |
| 1100 | 280 | 280 | 273 | 0 | 7 |
| 1400 | 340 | 355 | 333 | 15 | 7 |
| 1600 | 370 | 355 | 373 | 15 | 3 |
| 1900 | 430 | 430 | 433 | 0 | 3 |
Linear regression wins decisively on this near-linear dataset. The tree is limited by its piecewise-constant prediction: leaves 1 and 4 each contain 2 samples whose true values differ, forcing the average to miss both.
sklearn Implementation
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
dt_reg = DecisionTreeRegressor(criterion='squared_error', max_depth=2, random_state=42)
dt_reg.fit(X, y)
y_pred = dt_reg.predict(X)
print(f"Tree MSE: {mean_squared_error(y, y_pred):.2f}")
print(f"Tree R²: {dt_reg.score(X, y):.4f}")
print(f"Predictions: {y_pred}")
print(f"Unique leaf predictions: {np.unique(y_pred)}")Tree MSE: 208.33
Tree R²: 0.9720
Predictions: [200. 200. 280. 355. 355. 430.]
Unique leaf predictions: [200. 280. 355. 430.]
The 4 unique prediction values are the mean of each leaf: [200, 280, 355, 430]. Despite high R²=0.972, the MSE of 208 is 9× worse than the linear model's MSE of 22.
max_depth Effect on Regression
print(f"{'depth':>8} {'MSE':>10} {'steps':>8} {'leaves':>8}")
for d in [1, 2, 3, None]:
dt = DecisionTreeRegressor(max_depth=d, random_state=42)
dt.fit(X, y)
y_p = dt.predict(X)
mse = mean_squared_error(y, y_p)
steps = len(np.unique(y_p))
print(f"{str(d):>8} {mse:>10.2f} {steps:>8} {dt.get_n_leaves():>8}") depth MSE steps leaves
1 3041.67 2 2
2 208.33 4 4
3 22.22 6 6 ← one leaf per sample
None 0.00 6 6 ← memorizes training set
At depth=3 with 6 samples and 6 leaves: each sample has its own leaf, MSE=22.22 (interpolation errors from single-sample leaves). At depth=None: MSE=0 (perfect memorization). Neither of these generalizes.
When Does a Regression Tree Beat Linear Regression?
The staircase is piecewise constant — it assumes the target is flat within each region. Linear regression assumes a global linear trend. Trees win when:
- The true relationship has sharp breakpoints (e.g., a salary cap at a specific experience level)
- The relationship is nonlinear with different slopes in different regions
- There are strong feature interactions (the effect of feature A depends on feature B)
Linear regression wins when the true relationship is approximately linear (as here).
Test Your Understanding
-
At the root, the best threshold was with . The threshold produced . The left node of is pure (1 sample, Var=0), yet its IG is lower. Why does a single-sample pure left node not maximize variance reduction?
-
Leaf 1 (sq_ft ≤ 975) predicts 180k) and 850 sq_ft (true \pm$20k for both. What would the prediction be if you used the median instead of the mean? Would this reduce or increase MSE on this leaf?
-
At depth=3, MSE=22.22 with 6 leaves for 6 samples. Is this the irreducible error of the model, or could a depth=4 tree (if allowed) reduce it further? What would the depth=4 tree's MSE be?
-
The tree with
max_depth=Noneachieves MSE=0 on training data. If you added a 7th house with sq_ft=1100 and price=280k), what would the depth=None tree predict for sq_ft=1100? How does the presence of two training samples at the same sq_ft affect the tree? -
DecisionTreeRegressoruses MSE by default (criterion='squared_error'). An alternative is MAE (criterion='absolute_error'). The MAE-optimal leaf prediction is the median instead of the mean. For leaf 1 with : the mean is 200 and the median is also 200 (average of two). For a leaf with (one outlier): what is the mean vs median, and which leaf prediction minimizes MAE?