Back to blog
← View series: machine learning

~/blog

Decision Tree Regression

Jun 26, 20267 min readBy Mohammed Vasim
Machine LearningAIData Science

A classification tree uses entropy or Gini to measure node impurity. A regression tree uses variance — the same algorithmic skeleton, a different objective. At each leaf, instead of a majority vote, it predicts the mean of the samples that reached it.

Anchor dataset: 6 houses with sq_ft and price — the same anchor from the linear regression section.

python
import numpy as np

X = np.array([650, 850, 1100, 1400, 1600, 1900]).reshape(-1, 1)
y = np.array([180, 220, 280, 340, 370, 430])
# True OLS: ŷ = 53.33 + 0.20×sq_ft (from linear regression section)

Regression Trees vs Classification Trees

AspectClassification TreeRegression Tree
Impurity measureEntropy / GiniVariance (MSE)
Leaf predictionMajority classMean of leaf samples
Split criterionMaximize IG or Gini gainMaximize variance reduction
Output typeDiscrete classContinuous value

Everything else — threshold search, stopping conditions, pruning — is identical.

Root Node Variance

The variance at a node measures how spread out the target values are:

At the root:

Threshold Search for sq_ft

Split criterion for regression:

5 midpoint candidates: 750, 975, 1250, 1500, 1750.

Left Left Left VarRight Right Right VarWeighted Var
750[180]1800[220,280,340,370,430]3285424
975[180,220]200400[280,340,370,430]3553550
1250[180,220,280]226.71555[340,370,430]3801133(3/6)(1555)+(3/6)(1133)=1344
1500[180,220,280,340]2553350[370,430]400900
1750[180,220,280,340,370]2784544[430]4300

Best split: with . The split at sq_ft = 1250 reduces the total variance from 7422 to 1344 — an 82% reduction.

Level 1: Left Node (sq_ft ≤ 1250, samples [650, 850, 1100])

, , .

Test thresholds and :

  • : Left=[180], Var=0; Right=[220,280], , Var=900. Weighted: . .
  • : Left=[180,220], , Var=400; Right=[280], Var=0. Weighted: . .

Best: . Splits into:

  • Left-Left (sq_ft ≤ 975): [650, 850] → predict $200k
  • Left-Right (975 < sq_ft ≤ 1250): [1100] → predict $280k

Level 1: Right Node (sq_ft > 1250, samples [1400, 1600, 1900])

, , .

Test thresholds and :

  • : Left=[340], Var=0; Right=[370,430], , Var=900. Weighted: . .
  • : Left=[340,370], , Var=225; Right=[430], Var=0. Weighted: . .

Best: . Splits into:

  • Right-Left (1250 < sq_ft ≤ 1750): [1400, 1600] → predict $355k
  • Right-Right (sq_ft > 1750): [1900] → predict $430k

The 4-Leaf Staircase

LeafConditionSamples
1sq_ft ≤ 975650, 850$200k
2 sq_ft 1100$280k
3 sq_ft 1400, 1600$355k
4sq_ft 1900$430k
sq_ft Price ($k) <text x="63" y="228" font-size="8" fill="#64748b">650</text> <text x="108" y="228" font-size="8" fill="#64748b">850</text> <text x="175" y="228" font-size="8" fill="#64748b">1100</text> <text x="243" y="228" font-size="8" fill="#64748b">1400</text> <text x="290" y="228" font-size="8" fill="#64748b">1600</text> <text x="365" y="228" font-size="8" fill="#64748b">1900</text> <text x="44" y="207" text-anchor="end" font-size="8" fill="#64748b">180</text> <text x="44" y="170" text-anchor="end" font-size="8" fill="#64748b">250</text> <text x="44" y="120" text-anchor="end" font-size="8" fill="#64748b">340</text> <text x="44" y="62" text-anchor="end" font-size="8" fill="#64748b">430</text> <line x1="55" y1="212" x2="495" y2="65" stroke="#3b82f6" stroke-width="1.5" stroke-dasharray="4,3"/> <text x="450" y="60" font-size="9" fill="#3b82f6">Linear LR</text> <polyline points="55,175 138,175 138,157 175,157 213,157 213,130 305,130 305,70 495,70" fill="none" stroke="#f59e0b" stroke-width="2.5"/> <text x="420" y="75" font-size="9" fill="#f59e0b">Tree (staircase)</text> <line x1="138" y1="15" x2="138" y2="215" stroke="#94a3b8" stroke-width="1" stroke-dasharray="2,2"/> <line x1="213" y1="15" x2="213" y2="215" stroke="#94a3b8" stroke-width="1" stroke-dasharray="2,2"/> <line x1="305" y1="15" x2="305" y2="215" stroke="#94a3b8" stroke-width="1" stroke-dasharray="2,2"/> <text x="141" y="25" font-size="7" fill="#94a3b8">975</text> <text x="216" y="25" font-size="7" fill="#94a3b8">1250</text> <text x="308" y="25" font-size="7" fill="#94a3b8">1750</text> <circle cx="68" cy="207" r="5" fill="#334155"/> <circle cx="113" cy="177" r="5" fill="#334155"/> <circle cx="178" cy="152" r="5" fill="#334155"/> <circle cx="248" cy="120" r="5" fill="#334155"/> <circle cx="293" cy="98" r="5" fill="#334155"/> <circle cx="368" cy="65" r="5" fill="#334155"/>

The orange staircase shows tree predictions: constant within each leaf region. The blue dashed line is the linear regression fit. For this near-linear dataset, the linear model tracks the data better; the tree's staircase has visible errors at the leaf boundaries.

Predictions vs Actual — Tree vs Linear

sq_ftTree Linear Tree errorLinear error
650180200183203
850220200223203
110028028027307
1400340355333157
1600370355373153
190043043043303

Linear regression wins decisively on this near-linear dataset. The tree is limited by its piecewise-constant prediction: leaves 1 and 4 each contain 2 samples whose true values differ, forcing the average to miss both.

sklearn Implementation

python
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

dt_reg = DecisionTreeRegressor(criterion='squared_error', max_depth=2, random_state=42)
dt_reg.fit(X, y)

y_pred = dt_reg.predict(X)
print(f"Tree MSE: {mean_squared_error(y, y_pred):.2f}")
print(f"Tree R²:  {dt_reg.score(X, y):.4f}")
print(f"Predictions:     {y_pred}")
print(f"Unique leaf predictions: {np.unique(y_pred)}")
Tree MSE: 208.33 Tree R²: 0.9720 Predictions: [200. 200. 280. 355. 355. 430.] Unique leaf predictions: [200. 280. 355. 430.]

The 4 unique prediction values are the mean of each leaf: [200, 280, 355, 430]. Despite high R²=0.972, the MSE of 208 is 9× worse than the linear model's MSE of 22.

max_depth Effect on Regression

python
print(f"{'depth':>8} {'MSE':>10} {'steps':>8} {'leaves':>8}")
for d in [1, 2, 3, None]:
    dt = DecisionTreeRegressor(max_depth=d, random_state=42)
    dt.fit(X, y)
    y_p = dt.predict(X)
    mse = mean_squared_error(y, y_p)
    steps = len(np.unique(y_p))
    print(f"{str(d):>8} {mse:>10.2f} {steps:>8} {dt.get_n_leaves():>8}")
depth MSE steps leaves 1 3041.67 2 2 2 208.33 4 4 3 22.22 6 6 ← one leaf per sample None 0.00 6 6 ← memorizes training set

At depth=3 with 6 samples and 6 leaves: each sample has its own leaf, MSE=22.22 (interpolation errors from single-sample leaves). At depth=None: MSE=0 (perfect memorization). Neither of these generalizes.

When Does a Regression Tree Beat Linear Regression?

The staircase is piecewise constant — it assumes the target is flat within each region. Linear regression assumes a global linear trend. Trees win when:

  • The true relationship has sharp breakpoints (e.g., a salary cap at a specific experience level)
  • The relationship is nonlinear with different slopes in different regions
  • There are strong feature interactions (the effect of feature A depends on feature B)

Linear regression wins when the true relationship is approximately linear (as here).

Test Your Understanding

  1. At the root, the best threshold was with . The threshold produced . The left node of is pure (1 sample, Var=0), yet its IG is lower. Why does a single-sample pure left node not maximize variance reduction?

  2. Leaf 1 (sq_ft ≤ 975) predicts 180k) and 850 sq_ft (true \pm$20k for both. What would the prediction be if you used the median instead of the mean? Would this reduce or increase MSE on this leaf?

  3. At depth=3, MSE=22.22 with 6 leaves for 6 samples. Is this the irreducible error of the model, or could a depth=4 tree (if allowed) reduce it further? What would the depth=4 tree's MSE be?

  4. The tree with max_depth=None achieves MSE=0 on training data. If you added a 7th house with sq_ft=1100 and price=280k), what would the depth=None tree predict for sq_ft=1100? How does the presence of two training samples at the same sq_ft affect the tree?

  5. DecisionTreeRegressor uses MSE by default (criterion='squared_error'). An alternative is MAE (criterion='absolute_error'). The MAE-optimal leaf prediction is the median instead of the mean. For leaf 1 with : the mean is 200 and the median is also 200 (average of two). For a leaf with (one outlier): what is the mean vs median, and which leaf prediction minimizes MAE?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment