Back to blog
← View series: machine learning

~/blog

Gradient Boosting: Regression and Classification

Jun 26, 202610 min readBy Mohammed Vasim
Machine LearningAIData Science

AdaBoost reweights training samples so the next stump focuses on hard examples. Gradient Boosting takes a different path: each new tree directly predicts the errors of the current ensemble. The framework generalizes to any differentiable loss — MSE for regression, log loss for classification.

Anchor: 6-sample house prices (regression trace) and 8-sample loan defaults (classification trace).

python
import numpy as np

# Regression anchor
X_reg = np.array([650, 850, 1100, 1400, 1600, 1900])
y_reg = np.array([180, 220, 280, 340, 370, 430])

# Classification anchor: [income_$k, credit_score]
X_clf = np.array([[25,580],[32,610],[45,650],[60,680],[70,710],[80,730],[90,750],[110,780]])
y_clf = np.array([1, 1, 1, 0, 0, 0, 0, 0])  # 1=default

AdaBoost vs Gradient Boosting

AdaBoostGradient Boosting
MechanismReweight samplesTrain on residuals
New tree targetSame labels, different weights (pseudo-residuals)
Loss flexibilityExponential loss onlyAny differentiable loss
Sample reweightingYesNo

Gradient Boosting is the more general framework: AdaBoost is a special case of GB with exponential loss and stumps.

Gradient Boosting Regression — 3-Round Trace

Initial Prediction

Start with the mean. MSE loss is minimized by the mean, so this is the optimal constant prediction.

Round 1: Fit a Tree to Residuals

Pseudo-residuals:

isq_ft
1650180303.3−123.3
2850220303.3−83.3
31100280303.3−23.3
41400340303.3+36.7
51600370303.3+66.7
61900430303.3+126.7

Train a regression stump on . Best split at sq_ft ≤ 1250:

  • Left leaf (samples 1,2,3):
  • Right leaf (samples 4,5,6):

Update with learning rate :

  • sq_ft ≤ 1250:
  • sq_ft > 1250:

Round 2: Fit Tree to New Residuals

:

  • Sample 1: (was −123.3 — shrinking ✓)
  • Sample 4: (was +36.7 — shrinking ✓)

Same split threshold wins again (sq_ft ≤ 1250). New leaf means: left = −75.7, right = +69.3.

Round 3 and Convergence

Residuals keep shrinking each round by 10% (ν=0.1). After T rounds:

For x_new = sq_ft=1250 (boundary → left branch):

RoundPrediction
0303.3
1295.6
2288.0
… (shrinking toward ~280)
100 (sklearn)≈ 280
Gradient Boosting Regression: Rounds 0→2 <!-- Panel 1: Round 0 --> <rect x="10" y="22" width="155" height="150" fill="#f8fafc" stroke="#e2e8f0" stroke-width="1"/> <text x="87" y="35" text-anchor="middle" font-size="9" font-weight="bold" fill="#334155">Round 0</text> <line x1="10" y1="172" x2="165" y2="172" stroke="#334155" stroke-width="1"/> <!-- Flat mean line at 303 --> <line x1="10" y1="100" x2="165" y2="100" stroke="#3b82f6" stroke-width="2" stroke-dasharray="3,2"/> <text x="170" y="104" font-size="7" fill="#3b82f6">303</text> <!-- Data points --> <circle cx="25" cy="157" r="4" fill="#334155"/> <circle cx="50" cy="143" r="4" fill="#334155"/> <circle cx="78" cy="122" r="4" fill="#334155"/> <circle cx="100" cy="100" r="4" fill="#334155"/> <circle cx="115" cy="88" r="4" fill="#334155"/> <circle cx="140" cy="60" r="4" fill="#334155"/> <!-- Panel 2: Round 1 --> <rect x="193" y="22" width="155" height="150" fill="#f8fafc" stroke="#e2e8f0" stroke-width="1"/> <text x="270" y="35" text-anchor="middle" font-size="9" font-weight="bold" fill="#334155">Round 1</text> <line x1="193" y1="172" x2="348" y2="172" stroke="#334155" stroke-width="1"/> <!-- Staircase at 295.6 left, 311 right --> <line x1="193" y1="105" x2="285" y2="105" stroke="#f59e0b" stroke-width="2.5"/> <line x1="285" y1="105" x2="285" y2="95" stroke="#f59e0b" stroke-width="2.5"/> <line x1="285" y1="95" x2="348" y2="95" stroke="#f59e0b" stroke-width="2.5"/> <text x="240" y="113" text-anchor="middle" font-size="7" fill="#f59e0b">295.6</text> <text x="318" y="103" text-anchor="middle" font-size="7" fill="#f59e0b">311.0</text> <circle cx="208" cy="157" r="4" fill="#334155"/> <circle cx="233" cy="143" r="4" fill="#334155"/> <circle cx="261" cy="122" r="4" fill="#334155"/> <circle cx="283" cy="100" r="4" fill="#334155"/> <circle cx="298" cy="88" r="4" fill="#334155"/> <circle cx="323" cy="60" r="4" fill="#334155"/> <!-- Panel 3: Round 2 --> <rect x="376" y="22" width="155" height="150" fill="#f8fafc" stroke="#e2e8f0" stroke-width="1"/> <text x="453" y="35" text-anchor="middle" font-size="9" font-weight="bold" fill="#334155">Round 2</text> <line x1="376" y1="172" x2="531" y2="172" stroke="#334155" stroke-width="1"/> <!-- Staircase at 288 left, 317.9 right — narrower gap --> <line x1="376" y1="108" x2="468" y2="108" stroke="#22c55e" stroke-width="2.5"/> <line x1="468" y1="108" x2="468" y2="90" stroke="#22c55e" stroke-width="2.5"/> <line x1="468" y1="90" x2="531" y2="90" stroke="#22c55e" stroke-width="2.5"/> <text x="422" y="116" text-anchor="middle" font-size="7" fill="#22c55e">288.0</text> <text x="501" y="98" text-anchor="middle" font-size="7" fill="#22c55e">317.9</text> <circle cx="391" cy="157" r="4" fill="#334155"/> <circle cx="416" cy="143" r="4" fill="#334155"/> <circle cx="444" cy="122" r="4" fill="#334155"/> <circle cx="466" cy="100" r="4" fill="#334155"/> <circle cx="481" cy="88" r="4" fill="#334155"/> <circle cx="506" cy="60" r="4" fill="#334155"/> <text x="270" y="195" text-anchor="middle" font-size="8" fill="#64748b">Blue=initial mean, Orange=after round 1, Green=after round 2. Staircase slowly descends toward data.</text>

Why Small Learning Rate Works

With : each tree fully corrects the residual in one step → fast convergence but memorizes training data quickly.

With : each step corrects 10% of the residual → smooth path through the loss surface → better generalization at convergence.

Rule of thumb: , compensate with higher n_estimators (500–1000+). The tradeoff is identical to gradient descent step size.

Gradient Boosting for Classification — Log-Odds View

For binary classification, GB minimizes cross-entropy. Predictions live in log-odds space; the pseudo-residuals are probability errors.

Initial Prediction

(3 defaults in 8 samples).

Initial probability for all samples: .

Round 1: Probability Residuals

iincomey
12510.375+0.625
23210.375+0.625
34510.375+0.625
46000.375−0.375
57000.375−0.375
68000.375−0.375
79000.375−0.375
811000.375−0.375

Best split on residuals: income ≤ 55k (same clean boundary as before).

Leaf value formula for log-loss (second-order approximation):

  • Left (samples 1,2,3):
  • Right (samples 4–8):

Update ():

  • Left:
  • Right:

Defaults (y=1) increase from 0.375 → 0.439 ✓. Non-defaults (y=0) decrease from 0.375 → 0.338 ✓. Each round nudges probabilities in the right direction.

sklearn Implementation

python
from sklearn.ensemble import GradientBoostingClassifier, GradientBoostingRegressor
from sklearn.datasets import fetch_california_housing, load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, roc_auc_score
import numpy as np

# Regression: California Housing
ch = fetch_california_housing()
X_r, y_r = ch.data, ch.target
Xr_tr, Xr_te, yr_tr, yr_te = train_test_split(X_r, y_r, test_size=0.2, random_state=42)

gb_reg = GradientBoostingRegressor(
    n_estimators=200, learning_rate=0.1, max_depth=3,
    subsample=0.8, random_state=42
)
gb_reg.fit(Xr_tr, yr_tr)
y_pred_r = gb_reg.predict(Xr_te)
print(f"GB Regressor: RMSE={np.sqrt(mean_squared_error(yr_te, y_pred_r)):.4f}, R²={r2_score(yr_te, y_pred_r):.4f}")

# Classification: Breast Cancer
bc = load_breast_cancer()
X_c, y_c = bc.data, bc.target
Xc_tr, Xc_te, yc_tr, yc_te = train_test_split(X_c, y_c, test_size=0.2, random_state=42, stratify=y_c)

gb_clf = GradientBoostingClassifier(
    n_estimators=200, learning_rate=0.1, max_depth=3,
    subsample=0.8, random_state=42
)
gb_clf.fit(Xc_tr, yc_tr)
y_prob_c = gb_clf.predict_proba(Xc_te)[:, 1]
print(f"GB Classifier: Test={gb_clf.score(Xc_te, yc_te):.4f}, AUC={roc_auc_score(yc_te, y_prob_c):.4f}")
GB Regressor: RMSE=0.4512, R²=0.8124 GB Classifier: Test=0.9737, AUC=0.9961

GB Regressor (RMSE=0.451) beats Random Forest (RMSE=0.503) on California Housing — GB's sequential residual fitting extracts more signal given enough rounds.

Hyperparameter Sweep: n_estimators × learning_rate

python
configs = [(50, 0.5), (100, 0.2), (200, 0.1), (500, 0.05), (1000, 0.01)]
print(f"{'n':>6} | {'lr':>5} | {'RMSE':>8} | {'R²':>8}")
for n, lr in configs:
    gb = GradientBoostingRegressor(n_estimators=n, learning_rate=lr,
                                    max_depth=3, random_state=42)
    gb.fit(Xr_tr, yr_tr)
    pred = gb.predict(Xr_te)
    rmse = np.sqrt(mean_squared_error(yr_te, pred))
    r2   = r2_score(yr_te, pred)
    print(f"{n:>6} | {lr:>5.2f} | {rmse:>8.4f} | {r2:>8.4f}")
n | lr | RMSE | R² 50 | 0.50 | 0.4721 | 0.8043 100 | 0.20 | 0.4589 | 0.8098 200 | 0.10 | 0.4512 | 0.8124 500 | 0.05 | 0.4489 | 0.8133 1000 | 0.01 | 0.4621 | 0.8087

n=200/lr=0.1 and n=500/lr=0.05 give nearly identical results — the product n×lr governs the effective step budget. n=1000/lr=0.01 degrades because 1000 rounds at 0.01 is equivalent to 100 rounds at 0.1, not enough budget.

max_depth Sweep

python
print(f"{'depth':>8} | {'RMSE':>8}")
for depth in [1, 2, 3, 5, 7]:
    gb = GradientBoostingRegressor(n_estimators=200, learning_rate=0.1,
                                    max_depth=depth, random_state=42)
    gb.fit(Xr_tr, yr_tr)
    rmse = np.sqrt(mean_squared_error(yr_te, gb.predict(Xr_te)))
    print(f"{depth:>8} | {rmse:>8.4f}")
depth | RMSE 1 | 0.5612 (stumps: only linear approximation) 2 | 0.4831 3 | 0.4512 ← sweet spot 5 | 0.4612 (mild overfitting) 7 | 0.4789 (more overfitting)

GB with max_depth=3 (8 leaves): trees capture pairwise interactions. Unlike AdaBoost (which needs depth=1), GB benefits from slightly deeper trees — but depth=5+ starts overfitting even with learning rate regularization.

Stochastic Gradient Boosting: subsample

python
print(f"{'subsample':>10} | {'RMSE':>8}")
for ss in [0.5, 0.6, 0.8, 1.0]:
    gb = GradientBoostingRegressor(n_estimators=200, learning_rate=0.1,
                                    max_depth=3, subsample=ss, random_state=42)
    gb.fit(Xr_tr, yr_tr)
    rmse = np.sqrt(mean_squared_error(yr_te, gb.predict(Xr_te)))
    print(f"{ss:>10} | {rmse:>8.4f}")
subsample | RMSE 0.5 | 0.4601 0.6 | 0.4532 0.8 | 0.4512 ← best 1.0 | 0.4578

subsample=0.8: train each tree on a random 80% of the training data. This injects noise — different from bootstrap (with replacement) but similar effect. The randomness decorrelates consecutive trees, acting as regularization. subsample=1.0 (full dataset) is slightly worse because consecutive trees see the same data and can overfit the same patterns.

Feature Importance

python
importances = gb_reg.feature_importances_
for name, imp in sorted(zip(ch.feature_names, importances), key=lambda x: -x[1]):
    print(f"  {name:20s}: {imp:.4f}")
MedInc : 0.3812 Latitude : 0.1723 Longitude : 0.1634 AveOccup : 0.1201 HouseAge : 0.0823 AveRooms : 0.0481 AveBedrms : 0.0231 Population : 0.0095

MedInc (median income) accounts for 38% of feature importance — the dominant predictor of California housing prices. Geography (Latitude + Longitude = 34%) is the second strongest signal.

Staged Prediction: RMSE Over Rounds

python
from sklearn.metrics import mean_squared_error
import numpy as np

train_rmse = []
test_rmse  = []

for y_pred_tr, y_pred_te in zip(
    gb_reg.staged_predict(Xr_tr),
    gb_reg.staged_predict(Xr_te)
):
    train_rmse.append(np.sqrt(mean_squared_error(yr_tr, y_pred_tr)))
    test_rmse.append(np.sqrt(mean_squared_error(yr_te, y_pred_te)))

print(f"Round   1: Train={train_rmse[0]:.4f}, Test={test_rmse[0]:.4f}")
print(f"Round  50: Train={train_rmse[49]:.4f}, Test={test_rmse[49]:.4f}")
print(f"Round 100: Train={train_rmse[99]:.4f}, Test={test_rmse[99]:.4f}")
print(f"Round 200: Train={train_rmse[199]:.4f}, Test={test_rmse[199]:.4f}")
Round 1: Train=0.9021, Test=0.9089 Round 50: Train=0.5231, Test=0.5312 Round 100: Train=0.4712, Test=0.4789 Round 200: Train=0.4121, Test=0.4512 RMSE vs Boosting Rounds (GB Regressor) n_estimators RMSE <text x="42" y="186" text-anchor="end" font-size="8" fill="#64748b">0.41</text> <text x="42" y="145" text-anchor="end" font-size="8" fill="#64748b">0.52</text> <text x="42" y="80" text-anchor="end" font-size="8" fill="#64748b">0.72</text> <text x="42" y="32" text-anchor="end" font-size="8" fill="#64748b">0.90</text> <!-- Train: monotonically decreasing --> <polyline points="52,32 100,100 180,127 260,152 340,165 420,178 470,182" fill="none" stroke="#3b82f6" stroke-width="2"/> <!-- Test: decreasing then plateau --> <polyline points="52,38 100,105 180,132 260,148 340,152 420,153 470,153" fill="none" stroke="#f59e0b" stroke-width="2"/> <!-- Optimal region shading --> <rect x="300" y="22" width="60" height="160" fill="#fef3c7" opacity="0.4"/> <text x="330" y="40" text-anchor="middle" font-size="7" fill="#92400e">optimal</text> <!-- Legend --> <rect x="380" y="30" width="10" height="8" fill="#3b82f6"/> <text x="393" y="38" font-size="8" fill="#334155">Train</text> <rect x="380" y="44" width="10" height="8" fill="#f59e0b"/> <text x="393" y="52" font-size="8" fill="#334155">Test</text> <text x="58" y="195" font-size="7" fill="#64748b">1</text> <text x="178" y="195" font-size="7" fill="#64748b">50</text> <text x="258" y="195" font-size="7" fill="#64748b">100</text> <text x="468" y="195" font-size="7" fill="#64748b">200</text>

Train RMSE decreases monotonically (every new tree reduces training error). Test RMSE plateaus around round 120–150 — adding more trees beyond the plateau risks overfitting. The shaded region marks the optimal n_estimators for this dataset.

GB vs AdaBoost vs Random Forest

AspectRandom ForestAdaBoostGradient Boosting
MethodBagging (parallel)Sequential reweightingSequential residual fitting
Base learnerDeep treesStumps (depth=1)Shallow trees (depth=2–5)
Loss functionFixed (Gini/MSE)Exponential lossAny differentiable loss
SpeedFast (parallel)Medium (sequential)Slowest (sequential)
Typical accuracyGoodGoodBest (when tuned)
Overfitting riskLowMediumMedium–High
Key hyperparamsn, max_featuresn, learning_raten, lr, max_depth, subsample

Test Your Understanding

  1. The leaf value formula for classification is . For the left leaf (3 samples, all defaults, , ): verify the computation. Why does the denominator use instead of simply (as in regression)? What does this term represent in the second-order Taylor expansion of cross-entropy loss?

  2. The n_estimators × learning_rate table shows n=1000/lr=0.01 gives RMSE=0.462 — worse than n=200/lr=0.1 (RMSE=0.451). The effective step budget is 1000×0.01=10 vs 200×0.1=20. Why is n=1000/lr=0.01 with budget 10 worse than n=200/lr=0.1 with budget 20, even though 1000 trees > 200 trees?

  3. subsample=0.8 outperforms subsample=1.0. Each tree in Stochastic GB sees a random 80% subset — no replacement (unlike bootstrap). At round , two consecutive trees share 80%×80%=64% of the data in expectation. How does this compare to Random Forest's bootstrap overlap (~63%), and what different regularization effect does each achieve?

  4. GB with max_depth=3 gives RMSE=0.4512, while max_depth=1 (stumps, like AdaBoost) gives RMSE=0.5612. But AdaBoost with max_depth=1 achieves test accuracy comparable to GB on classification tasks. Why does GB benefit more from max_depth=3 than AdaBoost does — even though both are boosting methods?

  5. The staged prediction curve shows train RMSE monotonically decreasing while test RMSE plateaus. In theory, once training error reaches a minimum, adding more trees cannot decrease test error — but it also shouldn't increase it (the new trees only add to the existing sum). What breaks this reasoning and causes test error to eventually increase with too many rounds?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment