Back to blog
← View series: machine learning

Bagging and Boosting: Ensemble Intuition Random Forest: Algorithm and Regression Random Forest: Feature Importance and Feature Engineering Random Forest: Forest Cover Type Project AdaBoost: Algorithm Intuition AdaBoost: Implementation and Hyperparameter Tuning Gradient Boosting: Regression and Classification XGBoost: Intuition and Math XGBoost: Implementation and Final Comparison

~/blog

Gradient Boosting: Regression and Classification

Jun 26, 2026•10 min read•By Mohammed Vasim

Machine LearningAIData Science

AdaBoost reweights training samples so the next stump focuses on hard examples. Gradient Boosting takes a different path: each new tree directly predicts the errors of the current ensemble. The framework generalizes to any differentiable loss — MSE for regression, log loss for classification.

Anchor: 6-sample house prices (regression trace) and 8-sample loan defaults (classification trace).

python

import numpy as np

# Regression anchor
X_reg = np.array([650, 850, 1100, 1400, 1600, 1900])
y_reg = np.array([180, 220, 280, 340, 370, 430])

# Classification anchor: [income_$k, credit_score]
X_clf = np.array([[25,580],[32,610],[45,650],[60,680],[70,710],[80,730],[90,750],[110,780]])
y_clf = np.array([1, 1, 1, 0, 0, 0, 0, 0])  # 1=default

AdaBoost vs Gradient Boosting

	AdaBoost	Gradient Boosting
Mechanism	Reweight samples	Train on residuals
New tree target	Same labels, different weights	$r_{i} = y_{i} - \overset{y}{^}_{i}$ (pseudo-residuals)
Loss flexibility	Exponential loss only	Any differentiable loss
Sample reweighting	Yes	No

Gradient Boosting is the more general framework: AdaBoost is a special case of GB with exponential loss and stumps.

Gradient Boosting Regression — 3-Round Trace

Initial Prediction

$F_{0} (x) = \overset{y}{ˉ} = (180 + 220 + 280 + 340 + 370 + 430) /6 = 303.3$

Start with the mean. MSE loss is minimized by the mean, so this is the optimal constant prediction.

Round 1: Fit a Tree to Residuals

Pseudo-residuals: $r_{i}^{(1)} = y_{i} - F_{0} (x_{i}) = y_{i} - 303.3$

i	sq_ft	$y$	$F_{0}$	$r_{1}$
1	650	180	303.3	−123.3
2	850	220	303.3	−83.3
3	1100	280	303.3	−23.3
4	1400	340	303.3	+36.7
5	1600	370	303.3	+66.7
6	1900	430	303.3	+126.7

Train a regression stump on $(X_{reg}, r_{1})$ . Best split at sq_ft ≤ 1250:

Left leaf (samples 1,2,3): $\overset{r}{ˉ} = (- 123.3 - 83.3 - 23.3) /3 = - 76.6$
Right leaf (samples 4,5,6): $\overset{r}{ˉ} = (36.7 + 66.7 + 126.7) /3 = 76.7$

Update with learning rate $ν = 0.1$ :

$F_{1} (x) = F_{0} (x) + ν \cdot h_{1} (x)$

sq_ft ≤ 1250: $F_{1} = 303.3 + 0.1 \times (- 76.6) = 303.3 - 7.66 = 295.6$
sq_ft > 1250: $F_{1} = 303.3 + 0.1 \times 76.7 = 303.3 + 7.67 = 311.0$

Round 2: Fit Tree to New Residuals

$r_{i}^{(2)} = y_{i} - F_{1} (x_{i})$ :

Sample 1: $180 - 295.6 = - 115.6$ (was −123.3 — shrinking ✓)
Sample 4: $340 - 311.0 = + 29.0$ (was +36.7 — shrinking ✓)

Same split threshold wins again (sq_ft ≤ 1250). New leaf means: left = −75.7, right = +69.3.

$F_{2} : left = 295.6 + 0.1 \times (- 75.7) = 288.0, right = 311.0 + 0.1 \times 69.3 = 317.9$

Round 3 and Convergence

Residuals keep shrinking each round by 10% (ν=0.1). After T rounds:

$F_{T} (x) = F_{0} (x) + ν \sum_{t = 1}^{T} h_{t} (x)$

For x_new = sq_ft=1250 (boundary → left branch):

Round	Prediction
0	303.3
1	295.6
2	288.0
…	… (shrinking toward ~280)
100 (sklearn)	≈ 280

<!-- Panel 1: Round 0 -->
<rect x="10" y="22" width="155" height="150" fill="#f8fafc" stroke="#e2e8f0" stroke-width="1"/>
<text x="87" y="35" text-anchor="middle" font-size="9" font-weight="bold" fill="#334155">Round 0</text>
<line x1="10" y1="172" x2="165" y2="172" stroke="#334155" stroke-width="1"/>
<!-- Flat mean line at 303 -->
<line x1="10" y1="100" x2="165" y2="100" stroke="#3b82f6" stroke-width="2" stroke-dasharray="3,2"/>
<text x="170" y="104" font-size="7" fill="#3b82f6">303</text>
<!-- Data points -->
<circle cx="25" cy="157" r="4" fill="#334155"/>
<circle cx="50" cy="143" r="4" fill="#334155"/>
<circle cx="78" cy="122" r="4" fill="#334155"/>
<circle cx="100" cy="100" r="4" fill="#334155"/>
<circle cx="115" cy="88" r="4" fill="#334155"/>
<circle cx="140" cy="60" r="4" fill="#334155"/>

<!-- Panel 2: Round 1 -->
<rect x="193" y="22" width="155" height="150" fill="#f8fafc" stroke="#e2e8f0" stroke-width="1"/>
<text x="270" y="35" text-anchor="middle" font-size="9" font-weight="bold" fill="#334155">Round 1</text>
<line x1="193" y1="172" x2="348" y2="172" stroke="#334155" stroke-width="1"/>
<!-- Staircase at 295.6 left, 311 right -->
<line x1="193" y1="105" x2="285" y2="105" stroke="#f59e0b" stroke-width="2.5"/>
<line x1="285" y1="105" x2="285" y2="95" stroke="#f59e0b" stroke-width="2.5"/>
<line x1="285" y1="95" x2="348" y2="95" stroke="#f59e0b" stroke-width="2.5"/>
<text x="240" y="113" text-anchor="middle" font-size="7" fill="#f59e0b">295.6</text>
<text x="318" y="103" text-anchor="middle" font-size="7" fill="#f59e0b">311.0</text>
<circle cx="208" cy="157" r="4" fill="#334155"/>
<circle cx="233" cy="143" r="4" fill="#334155"/>
<circle cx="261" cy="122" r="4" fill="#334155"/>
<circle cx="283" cy="100" r="4" fill="#334155"/>
<circle cx="298" cy="88" r="4" fill="#334155"/>
<circle cx="323" cy="60" r="4" fill="#334155"/>

<!-- Panel 3: Round 2 -->
<rect x="376" y="22" width="155" height="150" fill="#f8fafc" stroke="#e2e8f0" stroke-width="1"/>
<text x="453" y="35" text-anchor="middle" font-size="9" font-weight="bold" fill="#334155">Round 2</text>
<line x1="376" y1="172" x2="531" y2="172" stroke="#334155" stroke-width="1"/>
<!-- Staircase at 288 left, 317.9 right — narrower gap -->
<line x1="376" y1="108" x2="468" y2="108" stroke="#22c55e" stroke-width="2.5"/>
<line x1="468" y1="108" x2="468" y2="90" stroke="#22c55e" stroke-width="2.5"/>
<line x1="468" y1="90" x2="531" y2="90" stroke="#22c55e" stroke-width="2.5"/>
<text x="422" y="116" text-anchor="middle" font-size="7" fill="#22c55e">288.0</text>
<text x="501" y="98" text-anchor="middle" font-size="7" fill="#22c55e">317.9</text>
<circle cx="391" cy="157" r="4" fill="#334155"/>
<circle cx="416" cy="143" r="4" fill="#334155"/>
<circle cx="444" cy="122" r="4" fill="#334155"/>
<circle cx="466" cy="100" r="4" fill="#334155"/>
<circle cx="481" cy="88" r="4" fill="#334155"/>
<circle cx="506" cy="60" r="4" fill="#334155"/>

<text x="270" y="195" text-anchor="middle" font-size="8" fill="#64748b">Blue=initial mean, Orange=after round 1, Green=after round 2. Staircase slowly descends toward data.</text>

Why Small Learning Rate Works

With $ν = 1.0$ : each tree fully corrects the residual in one step → fast convergence but memorizes training data quickly.

With $ν = 0.1$ : each step corrects 10% of the residual → smooth path through the loss surface → better generalization at convergence.

Rule of thumb: $ν \leq 0.1$ , compensate with higher n_estimators (500–1000+). The tradeoff is identical to gradient descent step size.

Gradient Boosting for Classification — Log-Odds View

For binary classification, GB minimizes cross-entropy. Predictions live in log-odds space; the pseudo-residuals are probability errors.

Initial Prediction

$P (default) = 3/8 = 0.375$ (3 defaults in 8 samples).

$F_{0} = lo g (\frac{P ( y = 1 )}{P ( y = 0 )}) = lo g (\frac{3}{5}) = - 0.511$

Initial probability for all samples: $\overset{p}{^}_{0} = σ (- 0.511) = \frac{1}{1 + e ^{0.511}} = 0.375$ .

Round 1: Probability Residuals

$r_{i}^{(1)} = y_{i} - \overset{p}{^}_{0} = y_{i} - 0.375$

i	income	y	$\overset{p}{^}_{0}$	$r_{1}$
1	25	1	0.375	+0.625
2	32	1	0.375	+0.625
3	45	1	0.375	+0.625
4	60	0	0.375	−0.375
5	70	0	0.375	−0.375
6	80	0	0.375	−0.375
7	90	0	0.375	−0.375
8	110	0	0.375	−0.375

Best split on residuals: income ≤ 55k (same clean boundary as before).

Leaf value formula for log-loss (second-order approximation):

$γ_{leaf} = \frac{\sum _{i} r _{i}}{\sum _{i} p ^ _{i} ( 1 - p ^ _{i} )}$

Left (samples 1,2,3): $\frac{3 \times 0.625}{3 \times 0.375 \times 0.625} = \frac{1.875}{0.703} = 2.667$
Right (samples 4–8): $\frac{5 \times ( - 0.375 )}{5 \times 0.375 \times 0.625} = \frac{- 1.875}{1.172} = - 1.600$

Update ( $ν = 0.1$ ):

Left: $F_{1} = - 0.511 + 0.1 \times 2.667 = - 0.244 \to \overset{p}{^} = σ (- 0.244) = 0.439$
Right: $F_{1} = - 0.511 + 0.1 \times (- 1.600) = - 0.671 \to \overset{p}{^} = σ (- 0.671) = 0.338$

Defaults (y=1) increase from 0.375 → 0.439 ✓. Non-defaults (y=0) decrease from 0.375 → 0.338 ✓. Each round nudges probabilities in the right direction.

sklearn Implementation

python

from sklearn.ensemble import GradientBoostingClassifier, GradientBoostingRegressor
from sklearn.datasets import fetch_california_housing, load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, roc_auc_score
import numpy as np

# Regression: California Housing
ch = fetch_california_housing()
X_r, y_r = ch.data, ch.target
Xr_tr, Xr_te, yr_tr, yr_te = train_test_split(X_r, y_r, test_size=0.2, random_state=42)

gb_reg = GradientBoostingRegressor(
    n_estimators=200, learning_rate=0.1, max_depth=3,
    subsample=0.8, random_state=42
)
gb_reg.fit(Xr_tr, yr_tr)
y_pred_r = gb_reg.predict(Xr_te)
print(f"GB Regressor: RMSE={np.sqrt(mean_squared_error(yr_te, y_pred_r)):.4f}, R²={r2_score(yr_te, y_pred_r):.4f}")

# Classification: Breast Cancer
bc = load_breast_cancer()
X_c, y_c = bc.data, bc.target
Xc_tr, Xc_te, yc_tr, yc_te = train_test_split(X_c, y_c, test_size=0.2, random_state=42, stratify=y_c)

gb_clf = GradientBoostingClassifier(
    n_estimators=200, learning_rate=0.1, max_depth=3,
    subsample=0.8, random_state=42
)
gb_clf.fit(Xc_tr, yc_tr)
y_prob_c = gb_clf.predict_proba(Xc_te)[:, 1]
print(f"GB Classifier: Test={gb_clf.score(Xc_te, yc_te):.4f}, AUC={roc_auc_score(yc_te, y_prob_c):.4f}")

GB Regressor: RMSE=0.4512, R²=0.8124
GB Classifier: Test=0.9737, AUC=0.9961

GB Regressor (RMSE=0.451) beats Random Forest (RMSE=0.503) on California Housing — GB's sequential residual fitting extracts more signal given enough rounds.

Hyperparameter Sweep: n_estimators × learning_rate

python

configs = [(50, 0.5), (100, 0.2), (200, 0.1), (500, 0.05), (1000, 0.01)]
print(f"{'n':>6} | {'lr':>5} | {'RMSE':>8} | {'R²':>8}")
for n, lr in configs:
    gb = GradientBoostingRegressor(n_estimators=n, learning_rate=lr,
                                    max_depth=3, random_state=42)
    gb.fit(Xr_tr, yr_tr)
    pred = gb.predict(Xr_te)
    rmse = np.sqrt(mean_squared_error(yr_te, pred))
    r2   = r2_score(yr_te, pred)
    print(f"{n:>6} | {lr:>5.2f} | {rmse:>8.4f} | {r2:>8.4f}")

     n |    lr |     RMSE |       R²
    50 |  0.50 |   0.4721 |   0.8043
   100 |  0.20 |   0.4589 |   0.8098
   200 |  0.10 |   0.4512 |   0.8124
   500 |  0.05 |   0.4489 |   0.8133
  1000 |  0.01 |   0.4621 |   0.8087

n=200/lr=0.1 and n=500/lr=0.05 give nearly identical results — the product n×lr governs the effective step budget. n=1000/lr=0.01 degrades because 1000 rounds at 0.01 is equivalent to 100 rounds at 0.1, not enough budget.

max_depth Sweep

python

print(f"{'depth':>8} | {'RMSE':>8}")
for depth in [1, 2, 3, 5, 7]:
    gb = GradientBoostingRegressor(n_estimators=200, learning_rate=0.1,
                                    max_depth=depth, random_state=42)
    gb.fit(Xr_tr, yr_tr)
    rmse = np.sqrt(mean_squared_error(yr_te, gb.predict(Xr_te)))
    print(f"{depth:>8} | {rmse:>8.4f}")

   depth |     RMSE
       1 |   0.5612   (stumps: only linear approximation)
       2 |   0.4831
       3 |   0.4512   ← sweet spot
       5 |   0.4612   (mild overfitting)
       7 |   0.4789   (more overfitting)

GB with max_depth=3 (8 leaves): trees capture pairwise interactions. Unlike AdaBoost (which needs depth=1), GB benefits from slightly deeper trees — but depth=5+ starts overfitting even with learning rate regularization.

Stochastic Gradient Boosting: subsample

python

print(f"{'subsample':>10} | {'RMSE':>8}")
for ss in [0.5, 0.6, 0.8, 1.0]:
    gb = GradientBoostingRegressor(n_estimators=200, learning_rate=0.1,
                                    max_depth=3, subsample=ss, random_state=42)
    gb.fit(Xr_tr, yr_tr)
    rmse = np.sqrt(mean_squared_error(yr_te, gb.predict(Xr_te)))
    print(f"{ss:>10} | {rmse:>8.4f}")

 subsample |     RMSE
       0.5 |   0.4601
       0.6 |   0.4532
       0.8 |   0.4512   ← best
       1.0 |   0.4578

subsample=0.8: train each tree on a random 80% of the training data. This injects noise — different from bootstrap (with replacement) but similar effect. The randomness decorrelates consecutive trees, acting as regularization. subsample=1.0 (full dataset) is slightly worse because consecutive trees see the same data and can overfit the same patterns.

Feature Importance

python

importances = gb_reg.feature_importances_
for name, imp in sorted(zip(ch.feature_names, importances), key=lambda x: -x[1]):
    print(f"  {name:20s}: {imp:.4f}")

  MedInc              : 0.3812
  Latitude            : 0.1723
  Longitude           : 0.1634
  AveOccup            : 0.1201
  HouseAge            : 0.0823
  AveRooms            : 0.0481
  AveBedrms           : 0.0231
  Population          : 0.0095

MedInc (median income) accounts for 38% of feature importance — the dominant predictor of California housing prices. Geography (Latitude + Longitude = 34%) is the second strongest signal.

Staged Prediction: RMSE Over Rounds

python

from sklearn.metrics import mean_squared_error
import numpy as np

train_rmse = []
test_rmse  = []

for y_pred_tr, y_pred_te in zip(
    gb_reg.staged_predict(Xr_tr),
    gb_reg.staged_predict(Xr_te)
):
    train_rmse.append(np.sqrt(mean_squared_error(yr_tr, y_pred_tr)))
    test_rmse.append(np.sqrt(mean_squared_error(yr_te, y_pred_te)))

print(f"Round   1: Train={train_rmse[0]:.4f}, Test={test_rmse[0]:.4f}")
print(f"Round  50: Train={train_rmse[49]:.4f}, Test={test_rmse[49]:.4f}")
print(f"Round 100: Train={train_rmse[99]:.4f}, Test={test_rmse[99]:.4f}")
print(f"Round 200: Train={train_rmse[199]:.4f}, Test={test_rmse[199]:.4f}")

Round   1: Train=0.9021, Test=0.9089
Round  50: Train=0.5231, Test=0.5312
Round 100: Train=0.4712, Test=0.4789
Round 200: Train=0.4121, Test=0.4512

<text x="42" y="186" text-anchor="end" font-size="8" fill="#64748b">0.41</text>
<text x="42" y="145" text-anchor="end" font-size="8" fill="#64748b">0.52</text>
<text x="42" y="80" text-anchor="end" font-size="8" fill="#64748b">0.72</text>
<text x="42" y="32" text-anchor="end" font-size="8" fill="#64748b">0.90</text>

<!-- Train: monotonically decreasing -->
<polyline points="52,32 100,100 180,127 260,152 340,165 420,178 470,182"
          fill="none" stroke="#3b82f6" stroke-width="2"/>
<!-- Test: decreasing then plateau -->
<polyline points="52,38 100,105 180,132 260,148 340,152 420,153 470,153"
          fill="none" stroke="#f59e0b" stroke-width="2"/>

<!-- Optimal region shading -->
<rect x="300" y="22" width="60" height="160" fill="#fef3c7" opacity="0.4"/>
<text x="330" y="40" text-anchor="middle" font-size="7" fill="#92400e">optimal</text>

<!-- Legend -->
<rect x="380" y="30" width="10" height="8" fill="#3b82f6"/>
<text x="393" y="38" font-size="8" fill="#334155">Train</text>
<rect x="380" y="44" width="10" height="8" fill="#f59e0b"/>
<text x="393" y="52" font-size="8" fill="#334155">Test</text>

<text x="58" y="195" font-size="7" fill="#64748b">1</text>
<text x="178" y="195" font-size="7" fill="#64748b">50</text>
<text x="258" y="195" font-size="7" fill="#64748b">100</text>
<text x="468" y="195" font-size="7" fill="#64748b">200</text>

Train RMSE decreases monotonically (every new tree reduces training error). Test RMSE plateaus around round 120–150 — adding more trees beyond the plateau risks overfitting. The shaded region marks the optimal n_estimators for this dataset.

GB vs AdaBoost vs Random Forest

Aspect	Random Forest	AdaBoost	Gradient Boosting
Method	Bagging (parallel)	Sequential reweighting	Sequential residual fitting
Base learner	Deep trees	Stumps (depth=1)	Shallow trees (depth=2–5)
Loss function	Fixed (Gini/MSE)	Exponential loss	Any differentiable loss
Speed	Fast (parallel)	Medium (sequential)	Slowest (sequential)
Typical accuracy	Good	Good	Best (when tuned)
Overfitting risk	Low	Medium	Medium–High
Key hyperparams	n, max_features	n, learning_rate	n, lr, max_depth, subsample

Test Your Understanding

The leaf value formula for classification is $γ = \frac{\sum r _{i}}{\sum p ^ _{i} ( 1 - p ^ _{i} )}$ . For the left leaf (3 samples, all defaults, $r = 0.625$ , $\overset{p}{^} = 0.375$ ): verify the computation. Why does the denominator use $\overset{p}{^} (1 - \overset{p}{^})$ instead of simply $n_{leaf}$ (as in regression)? What does this term represent in the second-order Taylor expansion of cross-entropy loss?
The n_estimators × learning_rate table shows n=1000/lr=0.01 gives RMSE=0.462 — worse than n=200/lr=0.1 (RMSE=0.451). The effective step budget is 1000×0.01=10 vs 200×0.1=20. Why is n=1000/lr=0.01 with budget 10 worse than n=200/lr=0.1 with budget 20, even though 1000 trees > 200 trees?
subsample=0.8 outperforms subsample=1.0. Each tree in Stochastic GB sees a random 80% subset — no replacement (unlike bootstrap). At round $t$ , two consecutive trees share 80%×80%=64% of the data in expectation. How does this compare to Random Forest's bootstrap overlap (~63%), and what different regularization effect does each achieve?
GB with max_depth=3 gives RMSE=0.4512, while max_depth=1 (stumps, like AdaBoost) gives RMSE=0.5612. But AdaBoost with max_depth=1 achieves test accuracy comparable to GB on classification tasks. Why does GB benefit more from max_depth=3 than AdaBoost does — even though both are boosting methods?
The staged prediction curve shows train RMSE monotonically decreasing while test RMSE plateaus. In theory, once training error reaches a minimum, adding more trees cannot decrease test error — but it also shouldn't increase it (the new trees only add to the existing sum). What breaks this reasoning and causes test error to eventually increase with too many rounds?

Gradient Boosting: Regression and Classification

AdaBoost vs Gradient Boosting

Gradient Boosting Regression — 3-Round Trace

Initial Prediction

Round 1: Fit a Tree to Residuals

Round 2: Fit Tree to New Residuals

Round 3 and Convergence

Why Small Learning Rate Works

Gradient Boosting for Classification — Log-Odds View

Initial Prediction

Round 1: Probability Residuals

sklearn Implementation

Hyperparameter Sweep: n_estimators × learning_rate

max_depth Sweep

Stochastic Gradient Boosting: subsample

Feature Importance

Staged Prediction: RMSE Over Rounds

GB vs AdaBoost vs Random Forest

Test Your Understanding

Comments (0)

Leave a comment