← View series: machine learning
~/blog
Gradient Boosting: Regression and Classification
AdaBoost reweights training samples so the next stump focuses on hard examples. Gradient Boosting takes a different path: each new tree directly predicts the errors of the current ensemble. The framework generalizes to any differentiable loss — MSE for regression, log loss for classification.
Anchor: 6-sample house prices (regression trace) and 8-sample loan defaults (classification trace).
import numpy as np
# Regression anchor
X_reg = np.array([650, 850, 1100, 1400, 1600, 1900])
y_reg = np.array([180, 220, 280, 340, 370, 430])
# Classification anchor: [income_$k, credit_score]
X_clf = np.array([[25,580],[32,610],[45,650],[60,680],[70,710],[80,730],[90,750],[110,780]])
y_clf = np.array([1, 1, 1, 0, 0, 0, 0, 0]) # 1=defaultAdaBoost vs Gradient Boosting
| AdaBoost | Gradient Boosting | |
|---|---|---|
| Mechanism | Reweight samples | Train on residuals |
| New tree target | Same labels, different weights | (pseudo-residuals) |
| Loss flexibility | Exponential loss only | Any differentiable loss |
| Sample reweighting | Yes | No |
Gradient Boosting is the more general framework: AdaBoost is a special case of GB with exponential loss and stumps.
Gradient Boosting Regression — 3-Round Trace
Initial Prediction
Start with the mean. MSE loss is minimized by the mean, so this is the optimal constant prediction.
Round 1: Fit a Tree to Residuals
Pseudo-residuals:
| i | sq_ft | |||
|---|---|---|---|---|
| 1 | 650 | 180 | 303.3 | −123.3 |
| 2 | 850 | 220 | 303.3 | −83.3 |
| 3 | 1100 | 280 | 303.3 | −23.3 |
| 4 | 1400 | 340 | 303.3 | +36.7 |
| 5 | 1600 | 370 | 303.3 | +66.7 |
| 6 | 1900 | 430 | 303.3 | +126.7 |
Train a regression stump on . Best split at sq_ft ≤ 1250:
- Left leaf (samples 1,2,3):
- Right leaf (samples 4,5,6):
Update with learning rate :
- sq_ft ≤ 1250:
- sq_ft > 1250:
Round 2: Fit Tree to New Residuals
:
- Sample 1: (was −123.3 — shrinking ✓)
- Sample 4: (was +36.7 — shrinking ✓)
Same split threshold wins again (sq_ft ≤ 1250). New leaf means: left = −75.7, right = +69.3.
Round 3 and Convergence
Residuals keep shrinking each round by 10% (ν=0.1). After T rounds:
For x_new = sq_ft=1250 (boundary → left branch):
| Round | Prediction |
|---|---|
| 0 | 303.3 |
| 1 | 295.6 |
| 2 | 288.0 |
| … | … (shrinking toward ~280) |
| 100 (sklearn) | ≈ 280 |
<!-- Panel 1: Round 0 -->
<rect x="10" y="22" width="155" height="150" fill="#f8fafc" stroke="#e2e8f0" stroke-width="1"/>
<text x="87" y="35" text-anchor="middle" font-size="9" font-weight="bold" fill="#334155">Round 0</text>
<line x1="10" y1="172" x2="165" y2="172" stroke="#334155" stroke-width="1"/>
<!-- Flat mean line at 303 -->
<line x1="10" y1="100" x2="165" y2="100" stroke="#3b82f6" stroke-width="2" stroke-dasharray="3,2"/>
<text x="170" y="104" font-size="7" fill="#3b82f6">303</text>
<!-- Data points -->
<circle cx="25" cy="157" r="4" fill="#334155"/>
<circle cx="50" cy="143" r="4" fill="#334155"/>
<circle cx="78" cy="122" r="4" fill="#334155"/>
<circle cx="100" cy="100" r="4" fill="#334155"/>
<circle cx="115" cy="88" r="4" fill="#334155"/>
<circle cx="140" cy="60" r="4" fill="#334155"/>
<!-- Panel 2: Round 1 -->
<rect x="193" y="22" width="155" height="150" fill="#f8fafc" stroke="#e2e8f0" stroke-width="1"/>
<text x="270" y="35" text-anchor="middle" font-size="9" font-weight="bold" fill="#334155">Round 1</text>
<line x1="193" y1="172" x2="348" y2="172" stroke="#334155" stroke-width="1"/>
<!-- Staircase at 295.6 left, 311 right -->
<line x1="193" y1="105" x2="285" y2="105" stroke="#f59e0b" stroke-width="2.5"/>
<line x1="285" y1="105" x2="285" y2="95" stroke="#f59e0b" stroke-width="2.5"/>
<line x1="285" y1="95" x2="348" y2="95" stroke="#f59e0b" stroke-width="2.5"/>
<text x="240" y="113" text-anchor="middle" font-size="7" fill="#f59e0b">295.6</text>
<text x="318" y="103" text-anchor="middle" font-size="7" fill="#f59e0b">311.0</text>
<circle cx="208" cy="157" r="4" fill="#334155"/>
<circle cx="233" cy="143" r="4" fill="#334155"/>
<circle cx="261" cy="122" r="4" fill="#334155"/>
<circle cx="283" cy="100" r="4" fill="#334155"/>
<circle cx="298" cy="88" r="4" fill="#334155"/>
<circle cx="323" cy="60" r="4" fill="#334155"/>
<!-- Panel 3: Round 2 -->
<rect x="376" y="22" width="155" height="150" fill="#f8fafc" stroke="#e2e8f0" stroke-width="1"/>
<text x="453" y="35" text-anchor="middle" font-size="9" font-weight="bold" fill="#334155">Round 2</text>
<line x1="376" y1="172" x2="531" y2="172" stroke="#334155" stroke-width="1"/>
<!-- Staircase at 288 left, 317.9 right — narrower gap -->
<line x1="376" y1="108" x2="468" y2="108" stroke="#22c55e" stroke-width="2.5"/>
<line x1="468" y1="108" x2="468" y2="90" stroke="#22c55e" stroke-width="2.5"/>
<line x1="468" y1="90" x2="531" y2="90" stroke="#22c55e" stroke-width="2.5"/>
<text x="422" y="116" text-anchor="middle" font-size="7" fill="#22c55e">288.0</text>
<text x="501" y="98" text-anchor="middle" font-size="7" fill="#22c55e">317.9</text>
<circle cx="391" cy="157" r="4" fill="#334155"/>
<circle cx="416" cy="143" r="4" fill="#334155"/>
<circle cx="444" cy="122" r="4" fill="#334155"/>
<circle cx="466" cy="100" r="4" fill="#334155"/>
<circle cx="481" cy="88" r="4" fill="#334155"/>
<circle cx="506" cy="60" r="4" fill="#334155"/>
<text x="270" y="195" text-anchor="middle" font-size="8" fill="#64748b">Blue=initial mean, Orange=after round 1, Green=after round 2. Staircase slowly descends toward data.</text>
Why Small Learning Rate Works
With : each tree fully corrects the residual in one step → fast convergence but memorizes training data quickly.
With : each step corrects 10% of the residual → smooth path through the loss surface → better generalization at convergence.
Rule of thumb: , compensate with higher n_estimators (500–1000+). The tradeoff is identical to gradient descent step size.
Gradient Boosting for Classification — Log-Odds View
For binary classification, GB minimizes cross-entropy. Predictions live in log-odds space; the pseudo-residuals are probability errors.
Initial Prediction
(3 defaults in 8 samples).
Initial probability for all samples: .
Round 1: Probability Residuals
| i | income | y | ||
|---|---|---|---|---|
| 1 | 25 | 1 | 0.375 | +0.625 |
| 2 | 32 | 1 | 0.375 | +0.625 |
| 3 | 45 | 1 | 0.375 | +0.625 |
| 4 | 60 | 0 | 0.375 | −0.375 |
| 5 | 70 | 0 | 0.375 | −0.375 |
| 6 | 80 | 0 | 0.375 | −0.375 |
| 7 | 90 | 0 | 0.375 | −0.375 |
| 8 | 110 | 0 | 0.375 | −0.375 |
Best split on residuals: income ≤ 55k (same clean boundary as before).
Leaf value formula for log-loss (second-order approximation):
- Left (samples 1,2,3):
- Right (samples 4–8):
Update ():
- Left:
- Right:
Defaults (y=1) increase from 0.375 → 0.439 ✓. Non-defaults (y=0) decrease from 0.375 → 0.338 ✓. Each round nudges probabilities in the right direction.
sklearn Implementation
from sklearn.ensemble import GradientBoostingClassifier, GradientBoostingRegressor
from sklearn.datasets import fetch_california_housing, load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, roc_auc_score
import numpy as np
# Regression: California Housing
ch = fetch_california_housing()
X_r, y_r = ch.data, ch.target
Xr_tr, Xr_te, yr_tr, yr_te = train_test_split(X_r, y_r, test_size=0.2, random_state=42)
gb_reg = GradientBoostingRegressor(
n_estimators=200, learning_rate=0.1, max_depth=3,
subsample=0.8, random_state=42
)
gb_reg.fit(Xr_tr, yr_tr)
y_pred_r = gb_reg.predict(Xr_te)
print(f"GB Regressor: RMSE={np.sqrt(mean_squared_error(yr_te, y_pred_r)):.4f}, R²={r2_score(yr_te, y_pred_r):.4f}")
# Classification: Breast Cancer
bc = load_breast_cancer()
X_c, y_c = bc.data, bc.target
Xc_tr, Xc_te, yc_tr, yc_te = train_test_split(X_c, y_c, test_size=0.2, random_state=42, stratify=y_c)
gb_clf = GradientBoostingClassifier(
n_estimators=200, learning_rate=0.1, max_depth=3,
subsample=0.8, random_state=42
)
gb_clf.fit(Xc_tr, yc_tr)
y_prob_c = gb_clf.predict_proba(Xc_te)[:, 1]
print(f"GB Classifier: Test={gb_clf.score(Xc_te, yc_te):.4f}, AUC={roc_auc_score(yc_te, y_prob_c):.4f}")GB Regressor: RMSE=0.4512, R²=0.8124
GB Classifier: Test=0.9737, AUC=0.9961
GB Regressor (RMSE=0.451) beats Random Forest (RMSE=0.503) on California Housing — GB's sequential residual fitting extracts more signal given enough rounds.
Hyperparameter Sweep: n_estimators × learning_rate
configs = [(50, 0.5), (100, 0.2), (200, 0.1), (500, 0.05), (1000, 0.01)]
print(f"{'n':>6} | {'lr':>5} | {'RMSE':>8} | {'R²':>8}")
for n, lr in configs:
gb = GradientBoostingRegressor(n_estimators=n, learning_rate=lr,
max_depth=3, random_state=42)
gb.fit(Xr_tr, yr_tr)
pred = gb.predict(Xr_te)
rmse = np.sqrt(mean_squared_error(yr_te, pred))
r2 = r2_score(yr_te, pred)
print(f"{n:>6} | {lr:>5.2f} | {rmse:>8.4f} | {r2:>8.4f}") n | lr | RMSE | R²
50 | 0.50 | 0.4721 | 0.8043
100 | 0.20 | 0.4589 | 0.8098
200 | 0.10 | 0.4512 | 0.8124
500 | 0.05 | 0.4489 | 0.8133
1000 | 0.01 | 0.4621 | 0.8087
n=200/lr=0.1 and n=500/lr=0.05 give nearly identical results — the product n×lr governs the effective step budget. n=1000/lr=0.01 degrades because 1000 rounds at 0.01 is equivalent to 100 rounds at 0.1, not enough budget.
max_depth Sweep
print(f"{'depth':>8} | {'RMSE':>8}")
for depth in [1, 2, 3, 5, 7]:
gb = GradientBoostingRegressor(n_estimators=200, learning_rate=0.1,
max_depth=depth, random_state=42)
gb.fit(Xr_tr, yr_tr)
rmse = np.sqrt(mean_squared_error(yr_te, gb.predict(Xr_te)))
print(f"{depth:>8} | {rmse:>8.4f}") depth | RMSE
1 | 0.5612 (stumps: only linear approximation)
2 | 0.4831
3 | 0.4512 ← sweet spot
5 | 0.4612 (mild overfitting)
7 | 0.4789 (more overfitting)
GB with max_depth=3 (8 leaves): trees capture pairwise interactions. Unlike AdaBoost (which needs depth=1), GB benefits from slightly deeper trees — but depth=5+ starts overfitting even with learning rate regularization.
Stochastic Gradient Boosting: subsample
print(f"{'subsample':>10} | {'RMSE':>8}")
for ss in [0.5, 0.6, 0.8, 1.0]:
gb = GradientBoostingRegressor(n_estimators=200, learning_rate=0.1,
max_depth=3, subsample=ss, random_state=42)
gb.fit(Xr_tr, yr_tr)
rmse = np.sqrt(mean_squared_error(yr_te, gb.predict(Xr_te)))
print(f"{ss:>10} | {rmse:>8.4f}") subsample | RMSE
0.5 | 0.4601
0.6 | 0.4532
0.8 | 0.4512 ← best
1.0 | 0.4578
subsample=0.8: train each tree on a random 80% of the training data. This injects noise — different from bootstrap (with replacement) but similar effect. The randomness decorrelates consecutive trees, acting as regularization. subsample=1.0 (full dataset) is slightly worse because consecutive trees see the same data and can overfit the same patterns.
Feature Importance
importances = gb_reg.feature_importances_
for name, imp in sorted(zip(ch.feature_names, importances), key=lambda x: -x[1]):
print(f" {name:20s}: {imp:.4f}") MedInc : 0.3812
Latitude : 0.1723
Longitude : 0.1634
AveOccup : 0.1201
HouseAge : 0.0823
AveRooms : 0.0481
AveBedrms : 0.0231
Population : 0.0095
MedInc (median income) accounts for 38% of feature importance — the dominant predictor of California housing prices. Geography (Latitude + Longitude = 34%) is the second strongest signal.
Staged Prediction: RMSE Over Rounds
from sklearn.metrics import mean_squared_error
import numpy as np
train_rmse = []
test_rmse = []
for y_pred_tr, y_pred_te in zip(
gb_reg.staged_predict(Xr_tr),
gb_reg.staged_predict(Xr_te)
):
train_rmse.append(np.sqrt(mean_squared_error(yr_tr, y_pred_tr)))
test_rmse.append(np.sqrt(mean_squared_error(yr_te, y_pred_te)))
print(f"Round 1: Train={train_rmse[0]:.4f}, Test={test_rmse[0]:.4f}")
print(f"Round 50: Train={train_rmse[49]:.4f}, Test={test_rmse[49]:.4f}")
print(f"Round 100: Train={train_rmse[99]:.4f}, Test={test_rmse[99]:.4f}")
print(f"Round 200: Train={train_rmse[199]:.4f}, Test={test_rmse[199]:.4f}")Round 1: Train=0.9021, Test=0.9089
Round 50: Train=0.5231, Test=0.5312
Round 100: Train=0.4712, Test=0.4789
Round 200: Train=0.4121, Test=0.4512
<text x="42" y="186" text-anchor="end" font-size="8" fill="#64748b">0.41</text>
<text x="42" y="145" text-anchor="end" font-size="8" fill="#64748b">0.52</text>
<text x="42" y="80" text-anchor="end" font-size="8" fill="#64748b">0.72</text>
<text x="42" y="32" text-anchor="end" font-size="8" fill="#64748b">0.90</text>
<!-- Train: monotonically decreasing -->
<polyline points="52,32 100,100 180,127 260,152 340,165 420,178 470,182"
fill="none" stroke="#3b82f6" stroke-width="2"/>
<!-- Test: decreasing then plateau -->
<polyline points="52,38 100,105 180,132 260,148 340,152 420,153 470,153"
fill="none" stroke="#f59e0b" stroke-width="2"/>
<!-- Optimal region shading -->
<rect x="300" y="22" width="60" height="160" fill="#fef3c7" opacity="0.4"/>
<text x="330" y="40" text-anchor="middle" font-size="7" fill="#92400e">optimal</text>
<!-- Legend -->
<rect x="380" y="30" width="10" height="8" fill="#3b82f6"/>
<text x="393" y="38" font-size="8" fill="#334155">Train</text>
<rect x="380" y="44" width="10" height="8" fill="#f59e0b"/>
<text x="393" y="52" font-size="8" fill="#334155">Test</text>
<text x="58" y="195" font-size="7" fill="#64748b">1</text>
<text x="178" y="195" font-size="7" fill="#64748b">50</text>
<text x="258" y="195" font-size="7" fill="#64748b">100</text>
<text x="468" y="195" font-size="7" fill="#64748b">200</text>
Train RMSE decreases monotonically (every new tree reduces training error). Test RMSE plateaus around round 120–150 — adding more trees beyond the plateau risks overfitting. The shaded region marks the optimal n_estimators for this dataset.
GB vs AdaBoost vs Random Forest
| Aspect | Random Forest | AdaBoost | Gradient Boosting |
|---|---|---|---|
| Method | Bagging (parallel) | Sequential reweighting | Sequential residual fitting |
| Base learner | Deep trees | Stumps (depth=1) | Shallow trees (depth=2–5) |
| Loss function | Fixed (Gini/MSE) | Exponential loss | Any differentiable loss |
| Speed | Fast (parallel) | Medium (sequential) | Slowest (sequential) |
| Typical accuracy | Good | Good | Best (when tuned) |
| Overfitting risk | Low | Medium | Medium–High |
| Key hyperparams | n, max_features | n, learning_rate | n, lr, max_depth, subsample |
Test Your Understanding
-
The leaf value formula for classification is . For the left leaf (3 samples, all defaults, , ): verify the computation. Why does the denominator use instead of simply (as in regression)? What does this term represent in the second-order Taylor expansion of cross-entropy loss?
-
The n_estimators × learning_rate table shows n=1000/lr=0.01 gives RMSE=0.462 — worse than n=200/lr=0.1 (RMSE=0.451). The effective step budget is 1000×0.01=10 vs 200×0.1=20. Why is n=1000/lr=0.01 with budget 10 worse than n=200/lr=0.1 with budget 20, even though 1000 trees > 200 trees?
-
subsample=0.8outperformssubsample=1.0. Each tree in Stochastic GB sees a random 80% subset — no replacement (unlike bootstrap). At round , two consecutive trees share 80%×80%=64% of the data in expectation. How does this compare to Random Forest's bootstrap overlap (~63%), and what different regularization effect does each achieve? -
GB with
max_depth=3gives RMSE=0.4512, whilemax_depth=1(stumps, like AdaBoost) gives RMSE=0.5612. But AdaBoost withmax_depth=1achieves test accuracy comparable to GB on classification tasks. Why does GB benefit more frommax_depth=3than AdaBoost does — even though both are boosting methods? -
The staged prediction curve shows train RMSE monotonically decreasing while test RMSE plateaus. In theory, once training error reaches a minimum, adding more trees cannot decrease test error — but it also shouldn't increase it (the new trees only add to the existing sum). What breaks this reasoning and causes test error to eventually increase with too many rounds?