Back to blog
← View series: machine learning

Bagging and Boosting: Ensemble Intuition Random Forest: Algorithm and Regression Random Forest: Feature Importance and Feature Engineering Random Forest: Forest Cover Type Project AdaBoost: Algorithm Intuition AdaBoost: Implementation and Hyperparameter Tuning Gradient Boosting: Regression and Classification XGBoost: Intuition and Math XGBoost: Implementation and Final Comparison

~/blog

XGBoost: Intuition and Math

Jun 26, 2026•8 min read•By Mohammed Vasim

Machine LearningAIData Science

sklearn's GradientBoostingRegressor and XGBoost both build trees on residuals. XGBoost differs in three ways: it uses the second-order Taylor approximation of the loss (not just the first-order residual), adds explicit regularization to the objective, and uses histogram-based split finding for speed. This post derives where those differences come from.

Anchor: 6-sample house prices — same as the Gradient Boosting post.

python

import numpy as np

X = np.array([650, 850, 1100, 1400, 1600, 1900])
y = np.array([180, 220, 280, 340, 370, 430])
# F₀ = mean(y) = 303.3 — same starting point as vanilla GB

XGBoost vs sklearn GradientBoosting

Aspect	sklearn GradientBoosting	XGBoost
Split finding	Exhaustive $O (n d)$ per level	Histogram-based $O (B d)$ , $B \leq 256$
Regularization	None built-in	L1 ( $α$ ) and L2 ( $λ$ ) on leaf weights
Missing values	Requires imputation	Learns split direction for missing
Parallel	Sequential, no parallelism	Column-parallel split finding
Speed	Slow on large datasets	10–100× faster in practice
Memory	Stores all samples	Cache-aware column block structure
Objective	Loss only	Loss + explicit regularization term
Taylor order	First-order (pseudo-residuals)	Second-order (gradient + hessian)

XGBoost Objective Function

At tree $t$ , XGBoost minimizes a regularized objective using a second-order Taylor expansion of the loss:

$Obj^{(t)} \approx \sum_{i} [g_{i} w_{q (x_{i})} + \frac{1}{2} (h_{i} + λ) w_{q (x_{i})}^{2}] + γ T$

Where:

$g_{i} = \frac{\partial L ( y _{i} , F _{t - 1} ( x _{i} ))}{\partial F}$ — first-order gradient (residual for MSE)
$h_{i} = \frac{\partial ^{2} L ( y _{i} , F _{t - 1} ( x _{i} ))}{\partial F ^{2}}$ — second-order gradient (hessian); for MSE: $h_{i} = 1$
$w_{j}$ — leaf weight for leaf $j$ (what we're optimizing)
$λ$ — L2 regularization coefficient on leaf weights
$γ$ — minimum gain required to create any split (pruning threshold)
$T$ — number of leaves (penalizes tree complexity)

For MSE loss: $g_{i} = \overset{y}{^}_{i} - y_{i}$ (opposite sign of residual), $h_{i} = 1$ for all samples.

Optimal Leaf Weight Formula

Group samples in leaf $j$ as $I_{j}$ . Define: $G_{j} = \sum_{i \in I_{j}} g_{i}, H_{j} = \sum_{i \in I_{j}} h_{i}$

Taking $\frac{d Obj}{d w _{j}} = 0$ :

$w_{j}^{*} = - \frac{G _{j}}{H _{j} + λ}$

When $λ = 0$ : $w_{j}^{*} = - G_{j} / H_{j} = - \frac{\sum r _{i}}{n _{j}}$ = mean residual — exactly vanilla GB.

When $λ > 0$ : leaf weights shrink toward zero. The larger $λ$ , the more regularized the tree.

Gain Formula for Split Finding

The objective improvement from splitting a leaf into left ( $L$ ) and right ( $R$ ):

$Gain = \frac{1}{2} [\frac{G _{L}^{2}}{H _{L} + λ} + \frac{G _{R}^{2}}{H _{R} + λ} - \frac{( G _{L} + G _{R} ) ^{2}}{H _{L} + H _{R} + λ}] - γ$

Only create the split if Gain $> 0$ . $γ$ sets the minimum gain threshold — larger $γ$ prunes more aggressively.

Manual Trace: 6-Sample Anchor (Tree 1, $λ = 1$ , $γ = 0$ )

Gradient Table

$F_{0} = 303.3$ (same mean). For MSE: $g_{i} = F_{0} - y_{i}$ (note: positive when predicting too high).

$i$	sq_ft	$y$	$F_{0}$	$g_{i} = F_{0} - y_{i}$	$h_{i}$
1	650	180	303.3	+123.3	1
2	850	220	303.3	+83.3	1
3	1100	280	303.3	+23.3	1
4	1400	340	303.3	−36.7	1
5	1600	370	303.3	−66.7	1
6	1900	430	303.3	−126.7	1

$G_{total} = 123.3 + 83.3 + 23.3 - 36.7 - 66.7 - 126.7 = 0.0$ , $H_{total} = 6$ .

$G_{total} = 0$ because $F_{0} = \overset{y}{ˉ}$ — the mean prediction cancels all gradients at the root.

Evaluate Split at sq_ft ≤ 1250

Left (samples 1,2,3): $G_{L} = 123.3 + 83.3 + 23.3 = + 229.9$ , $H_{L} = 3$ .

Right (samples 4,5,6): $G_{R} = - 36.7 - 66.7 - 126.7 = - 230.1$ , $H_{R} = 3$ .

Term	Computation	Value
Left score	$G_{L}^{2} / (H_{L} + λ) = 229. 9^{2} / (3 + 1)$	$52854.01/4 = 13213.5$
Right score	$G_{R}^{2} / (H_{R} + λ) = 230. 1^{2} / (3 + 1)$	$52946.01/4 = 13236.5$
Root score	$G_{total}^{2} / (H_{total} + λ) = 0/ (6 + 1)$	$0.0$
Gain	$(1/2) (13213.5 + 13236.5 - 0.0) - 0$	$13225.0$

Optimal Leaf Weights

$w_{L}^{*} = - \frac{G _{L}}{H _{L} + λ} = - \frac{229.9}{3 + 1} = - 57.5$

$w_{R}^{*} = - \frac{G _{R}}{H _{R} + λ} = - \frac{- 230.1}{3 + 1} = + 57.5$

Update with $ν = 0.3$ (XGBoost default)

$F_{1} (x) = F_{0} + ν \cdot w^{*}$

sq_ft ≤ 1250: $303.3 + 0.3 \times (- 57.5) = 303.3 - 17.25 = 286.1$
sq_ft > 1250: $303.3 + 0.3 \times (+ 57.5) = 303.3 + 17.25 = 320.6$

Compare to vanilla GB (ν=0.1, leaf = mean residual = ±76.6): update was 303.3 ± 7.66, giving 295.6/311.0. XGBoost with λ=1 uses smaller leaf weights (±57.5) but larger ν (0.3), landing at similar positions — same effect, different parameterization.

Effect of λ (L2 Regularization)

$λ$	$w_{L}^{*}$	$w_{R}^{*}$	Gain
0	$- 229.9/3 = - 76.6$	$+ 76.7$	17600
1	$- 229.9/4 = - 57.5$	$+ 57.5$	13225
10	$- 229.9/13 = - 17.7$	$+ 17.7$	4293
100	$- 229.9/103 = - 2.23$	$+ 2.23$	543

Larger $λ$ : leaf weights shrink toward zero (the tree corrects the residual less aggressively). Gain decreases — at high $λ$ , splits that would have been created ( $Gain > 0$ ) may be rejected. $γ$ provides a hard cutoff: if $Gain < γ$ , the split is never created regardless of $λ$ .

min_child_weight

XGBoost only creates a split if each child node satisfies $H_{j} \geq min_child_weight$ .

MSE loss: $h_{i} = 1$ , so $H_{j} = n_{j}$ . min_child_weight=5 requires at least 5 samples per leaf — identical to min_samples_leaf in sklearn.
Logistic loss: $h_{i} = \overset{p}{^}_{i} (1 - \overset{p}{^}_{i})$ . $H_{j} = \sum \overset{p}{^} (1 - \overset{p}{^})$ . For probabilities near 0 or 1, $h_{i} \approx 0$ . A leaf with 100 near-certain predictions can have $H_{j} < 5$ — preventing overly confident leaves.

Level-Wise vs Leaf-Wise Tree Growth

Level-wise (sklearn GB, XGBoost default):   Leaf-wise (LightGBM default):

Round 1: Root splits.                         Round 1: Root splits.
Round 2: Both children split.                 Round 2: Best child (highest gain) splits.
Round 3: All 4 grandchildren split.           Round 3: Best remaining leaf splits.
→ Balanced tree (max_depth constraint)        → Unbalanced but higher total gain

<!-- Level-wise tree -->
<rect x="105" y="20" width="70" height="28" rx="4" fill="#dbeafe" stroke="#3b82f6" stroke-width="1.5"/>
<text x="140" y="38" text-anchor="middle" font-size="9" fill="#1e40af">Root</text>
<line x1="120" y1="48" x2="75" y2="70" stroke="#94a3b8" stroke-width="1.5"/>
<line x1="160" y1="48" x2="205" y2="70" stroke="#94a3b8" stroke-width="1.5"/>

<rect x="45" y="70" width="60" height="28" rx="4" fill="#fef3c7" stroke="#f59e0b" stroke-width="1.5"/>
<text x="75" y="88" text-anchor="middle" font-size="9" fill="#92400e">L child</text>
<rect x="175" y="70" width="60" height="28" rx="4" fill="#fef3c7" stroke="#f59e0b" stroke-width="1.5"/>
<text x="205" y="88" text-anchor="middle" font-size="9" fill="#92400e">R child</text>

<line x1="60" y1="98" x2="38" y2="120" stroke="#94a3b8" stroke-width="1.5"/>
<line x1="90" y1="98" x2="112" y2="120" stroke="#94a3b8" stroke-width="1.5"/>
<line x1="190" y1="98" x2="168" y2="120" stroke="#94a3b8" stroke-width="1.5"/>
<line x1="220" y1="98" x2="242" y2="120" stroke="#94a3b8" stroke-width="1.5"/>

<rect x="18" y="120" width="40" height="24" rx="3" fill="#dcfce7" stroke="#22c55e" stroke-width="1"/>
<text x="38" y="136" text-anchor="middle" font-size="8" fill="#15803d">LL</text>
<rect x="92" y="120" width="40" height="24" rx="3" fill="#dcfce7" stroke="#22c55e" stroke-width="1"/>
<text x="112" y="136" text-anchor="middle" font-size="8" fill="#15803d">LR</text>
<rect x="148" y="120" width="40" height="24" rx="3" fill="#dcfce7" stroke="#22c55e" stroke-width="1"/>
<text x="168" y="136" text-anchor="middle" font-size="8" fill="#15803d">RL</text>
<rect x="222" y="120" width="40" height="24" rx="3" fill="#dcfce7" stroke="#22c55e" stroke-width="1"/>
<text x="242" y="136" text-anchor="middle" font-size="8" fill="#15803d">RR</text>

<text x="140" y="165" text-anchor="middle" font-size="8" fill="#64748b">All nodes at same depth split together</text>
<text x="140" y="178" text-anchor="middle" font-size="8" fill="#64748b">Balanced → controlled by max_depth</text>

<!-- Leaf-wise tree -->
<rect x="380" y="20" width="70" height="28" rx="4" fill="#dbeafe" stroke="#3b82f6" stroke-width="1.5"/>
<text x="415" y="38" text-anchor="middle" font-size="9" fill="#1e40af">Root</text>
<line x1="395" y1="48" x2="355" y2="70" stroke="#94a3b8" stroke-width="1.5"/>
<line x1="435" y1="48" x2="475" y2="70" stroke="#94a3b8" stroke-width="1.5"/>

<rect x="325" y="70" width="60" height="28" rx="4" fill="#fef3c7" stroke="#f59e0b" stroke-width="1.5"/>
<text x="355" y="88" text-anchor="middle" font-size="9" fill="#92400e">L child</text>
<rect x="445" y="70" width="60" height="28" rx="4" fill="#dcfce7" stroke="#22c55e" stroke-width="1"/>
<text x="475" y="84" text-anchor="middle" font-size="9" fill="#15803d">R leaf</text>
<text x="475" y="95" text-anchor="middle" font-size="7" fill="#15803d">(low gain)</text>

<!-- Only the LEFT (high gain) child splits -->
<line x1="340" y1="98" x2="318" y2="120" stroke="#94a3b8" stroke-width="1.5"/>
<line x1="370" y1="98" x2="392" y2="120" stroke="#94a3b8" stroke-width="1.5"/>

<rect x="298" y="120" width="40" height="24" rx="3" fill="#fef3c7" stroke="#f59e0b" stroke-width="1"/>
<text x="318" y="136" text-anchor="middle" font-size="8" fill="#92400e">LL</text>
<rect x="372" y="120" width="40" height="24" rx="3" fill="#fef3c7" stroke="#f59e0b" stroke-width="1"/>
<text x="392" y="136" text-anchor="middle" font-size="8" fill="#92400e">LR</text>
<text x="335" y="136" font-size="7" fill="#f59e0b">↑ high gain</text>

<line x1="308" y1="144" x2="295" y2="165" stroke="#94a3b8" stroke-width="1.5"/>
<rect x="275" y="165" width="40" height="20" rx="3" fill="#dcfce7" stroke="#22c55e" stroke-width="1"/>
<text x="295" y="179" text-anchor="middle" font-size="8" fill="#15803d">LLL</text>

<text x="415" y="198" text-anchor="middle" font-size="8" fill="#64748b">Best-gain leaf always splits next</text>
<text x="415" y="210" text-anchor="middle" font-size="8" fill="#64748b">Unbalanced → deeper on high-gain paths</text>

Level-wise: all nodes at the same depth split together — controlled by max_depth. Leaf-wise: always split whichever existing leaf has the highest gain — faster convergence but risks deep paths that overfit. XGBoost uses level-wise by default; LightGBM uses leaf-wise with num_leaves to control depth.

Histogram-Based Split Finding

Vanilla GB (sklearn): for each feature, sort $n$ values → $n - 1$ candidate thresholds → $O (n)$ evaluations per feature per level → $O (n d)$ per level.

XGBoost: bin each feature into $B \leq 256$ histogram buckets. Only $B - 1$ thresholds per feature → $O (B d)$ per level. For $n = 1 0^{6}$ samples, this reduces split-finding from $1 0^{6} \times d$ to $255 \times d$ operations — ~4000× reduction.

The histogram is approximate: if the true optimal threshold falls between two bin boundaries, XGBoost uses the bin boundary. In practice, 256 bins is fine-grained enough that accuracy loss is negligible.

Missing Value Handling

XGBoost's sparsity-aware algorithm: during training, for each split candidate, evaluate both:

Route all missing values LEFT
Route all missing values RIGHT

Choose whichever direction reduces the objective more. At inference, the learned default direction is used for missing values. No imputation required — this is built into the split-finding algorithm.

python

import xgboost as xgb
import numpy as np

# Example: create data with NaN
X_with_nan = np.array([[1.0, np.nan], [2.0, 3.0], [np.nan, 4.0]])
y = np.array([0, 1, 1])

dtrain = xgb.DMatrix(X_with_nan, label=y)
# XGBoost handles NaN internally — no fillna needed

Test Your Understanding

At the root, $G_{total} = 0$ because $F_{0} = \overset{y}{ˉ}$ . The root score in the Gain formula is $G_{total}^{2} / (H_{total} + λ) = 0$ . Does this mean the root contributes nothing to the Gain, or does it serve a different purpose in the formula? What would happen to the Gain formula if you started with a non-mean $F_{0}$ (say, $F_{0} = 0$ )?
Optimal leaf weight is $w^{*} = - G_{j} / (H_{j} + λ)$ . For MSE, $g_{i} = F_{t - 1} (x_{i}) - y_{i}$ (not the usual residual sign). Verify: if we define pseudo-residuals as $r_{i} = y_{i} - F_{t - 1} (x_{i})$ (same sign as vanilla GB), then $G_{j} = - \sum r_{i}$ . Show that $w^{*} = - G_{j} / H_{j} = + mean (r_{i})$ when $λ = 0$ — i.e., XGBoost and vanilla GB agree at $λ = 0$ .
With $λ = 100$ : $w_{L}^{*} = - 229.9/103 = - 2.23$ . The prediction update is $F_{1} (left) = 303.3 + 0.3 \times (- 2.23) = 302.6$ . The true $\overset{y}{ˉ}_{left} = (180 + 220 + 280) /3 = 226.7$ . The model barely moves from the global mean. How many rounds would it take to approximately converge to the correct leaf prediction if every round gives a step of only 2.23?
Histogram binning uses $B = 256$ bins. For a feature with only 10 unique values (like bedrooms in the house price dataset), the histogram has only 9 boundaries regardless of $B$ . In this case, histogram XGBoost and exact vanilla GB give identical splits. For which types of features does histogram approximation actually matter, and for which is it irrelevant?
In leaf-wise tree growth (LightGBM), the model always splits the leaf with the highest gain. This can create one very deep branch while other branches stay as leaves. Why does LightGBM use num_leaves (total number of leaf nodes) instead of max_depth to control model complexity? What's a model with num_leaves=31 and max_depth=6 vs one with num_leaves=31 and max_depth=None — could they have the same structure?

XGBoost: Intuition and Math

XGBoost vs sklearn GradientBoosting

XGBoost Objective Function

Optimal Leaf Weight Formula

Gain Formula for Split Finding

Manual Trace: 6-Sample Anchor (Tree 1, $λ = 1$ , $γ = 0$ )

Gradient Table

Evaluate Split at sq_ft ≤ 1250

Optimal Leaf Weights

Update with $ν = 0.3$ (XGBoost default)

Effect of λ (L2 Regularization)

min_child_weight

Level-Wise vs Leaf-Wise Tree Growth

Histogram-Based Split Finding

Missing Value Handling

Test Your Understanding

Comments (0)

Leave a comment

XGBoost: Intuition and Math

XGBoost vs sklearn GradientBoosting

XGBoost Objective Function

Optimal Leaf Weight Formula

Gain Formula for Split Finding

Manual Trace: 6-Sample Anchor (Tree 1, λ=1, γ=0)

Gradient Table

Evaluate Split at sq_ft ≤ 1250

Optimal Leaf Weights

Update with ν=0.3 (XGBoost default)

Effect of λ (L2 Regularization)

min_child_weight

Level-Wise vs Leaf-Wise Tree Growth

Histogram-Based Split Finding

Missing Value Handling

Test Your Understanding

Comments (0)

Leave a comment

Manual Trace: 6-Sample Anchor (Tree 1, $λ = 1$ , $γ = 0$ )

Update with $ν = 0.3$ (XGBoost default)