← View series: machine learning
~/blog
XGBoost: Intuition and Math
sklearn's GradientBoostingRegressor and XGBoost both build trees on residuals. XGBoost differs in three ways: it uses the second-order Taylor approximation of the loss (not just the first-order residual), adds explicit regularization to the objective, and uses histogram-based split finding for speed. This post derives where those differences come from.
Anchor: 6-sample house prices — same as the Gradient Boosting post.
import numpy as np
X = np.array([650, 850, 1100, 1400, 1600, 1900])
y = np.array([180, 220, 280, 340, 370, 430])
# F₀ = mean(y) = 303.3 — same starting point as vanilla GBXGBoost vs sklearn GradientBoosting
| Aspect | sklearn GradientBoosting | XGBoost |
|---|---|---|
| Split finding | Exhaustive per level | Histogram-based , |
| Regularization | None built-in | L1 () and L2 () on leaf weights |
| Missing values | Requires imputation | Learns split direction for missing |
| Parallel | Sequential, no parallelism | Column-parallel split finding |
| Speed | Slow on large datasets | 10–100× faster in practice |
| Memory | Stores all samples | Cache-aware column block structure |
| Objective | Loss only | Loss + explicit regularization term |
| Taylor order | First-order (pseudo-residuals) | Second-order (gradient + hessian) |
XGBoost Objective Function
At tree , XGBoost minimizes a regularized objective using a second-order Taylor expansion of the loss:
Where:
- — first-order gradient (residual for MSE)
- — second-order gradient (hessian); for MSE:
- — leaf weight for leaf (what we're optimizing)
- — L2 regularization coefficient on leaf weights
- — minimum gain required to create any split (pruning threshold)
- — number of leaves (penalizes tree complexity)
For MSE loss: (opposite sign of residual), for all samples.
Optimal Leaf Weight Formula
Group samples in leaf as . Define:
Taking :
When : = mean residual — exactly vanilla GB.
When : leaf weights shrink toward zero. The larger , the more regularized the tree.
Gain Formula for Split Finding
The objective improvement from splitting a leaf into left () and right ():
Only create the split if Gain . sets the minimum gain threshold — larger prunes more aggressively.
Manual Trace: 6-Sample Anchor (Tree 1, , )
Gradient Table
(same mean). For MSE: (note: positive when predicting too high).
| sq_ft | |||||
|---|---|---|---|---|---|
| 1 | 650 | 180 | 303.3 | +123.3 | 1 |
| 2 | 850 | 220 | 303.3 | +83.3 | 1 |
| 3 | 1100 | 280 | 303.3 | +23.3 | 1 |
| 4 | 1400 | 340 | 303.3 | −36.7 | 1 |
| 5 | 1600 | 370 | 303.3 | −66.7 | 1 |
| 6 | 1900 | 430 | 303.3 | −126.7 | 1 |
, .
because — the mean prediction cancels all gradients at the root.
Evaluate Split at sq_ft ≤ 1250
Left (samples 1,2,3): , .
Right (samples 4,5,6): , .
| Term | Computation | Value |
|---|---|---|
| Left score | ||
| Right score | ||
| Root score | ||
| Gain |
Optimal Leaf Weights
Update with (XGBoost default)
- sq_ft ≤ 1250:
- sq_ft > 1250:
Compare to vanilla GB (ν=0.1, leaf = mean residual = ±76.6): update was 303.3 ± 7.66, giving 295.6/311.0. XGBoost with λ=1 uses smaller leaf weights (±57.5) but larger ν (0.3), landing at similar positions — same effect, different parameterization.
Effect of λ (L2 Regularization)
| Gain | |||
|---|---|---|---|
| 0 | 17600 | ||
| 1 | 13225 | ||
| 10 | 4293 | ||
| 100 | 543 |
Larger : leaf weights shrink toward zero (the tree corrects the residual less aggressively). Gain decreases — at high , splits that would have been created () may be rejected. provides a hard cutoff: if , the split is never created regardless of .
min_child_weight
XGBoost only creates a split if each child node satisfies .
- MSE loss: , so .
min_child_weight=5requires at least 5 samples per leaf — identical tomin_samples_leafin sklearn. - Logistic loss: . . For probabilities near 0 or 1, . A leaf with 100 near-certain predictions can have — preventing overly confident leaves.
Level-Wise vs Leaf-Wise Tree Growth
Level-wise (sklearn GB, XGBoost default): Leaf-wise (LightGBM default):
Round 1: Root splits. Round 1: Root splits.
Round 2: Both children split. Round 2: Best child (highest gain) splits.
Round 3: All 4 grandchildren split. Round 3: Best remaining leaf splits.
→ Balanced tree (max_depth constraint) → Unbalanced but higher total gain
<!-- Level-wise tree -->
<rect x="105" y="20" width="70" height="28" rx="4" fill="#dbeafe" stroke="#3b82f6" stroke-width="1.5"/>
<text x="140" y="38" text-anchor="middle" font-size="9" fill="#1e40af">Root</text>
<line x1="120" y1="48" x2="75" y2="70" stroke="#94a3b8" stroke-width="1.5"/>
<line x1="160" y1="48" x2="205" y2="70" stroke="#94a3b8" stroke-width="1.5"/>
<rect x="45" y="70" width="60" height="28" rx="4" fill="#fef3c7" stroke="#f59e0b" stroke-width="1.5"/>
<text x="75" y="88" text-anchor="middle" font-size="9" fill="#92400e">L child</text>
<rect x="175" y="70" width="60" height="28" rx="4" fill="#fef3c7" stroke="#f59e0b" stroke-width="1.5"/>
<text x="205" y="88" text-anchor="middle" font-size="9" fill="#92400e">R child</text>
<line x1="60" y1="98" x2="38" y2="120" stroke="#94a3b8" stroke-width="1.5"/>
<line x1="90" y1="98" x2="112" y2="120" stroke="#94a3b8" stroke-width="1.5"/>
<line x1="190" y1="98" x2="168" y2="120" stroke="#94a3b8" stroke-width="1.5"/>
<line x1="220" y1="98" x2="242" y2="120" stroke="#94a3b8" stroke-width="1.5"/>
<rect x="18" y="120" width="40" height="24" rx="3" fill="#dcfce7" stroke="#22c55e" stroke-width="1"/>
<text x="38" y="136" text-anchor="middle" font-size="8" fill="#15803d">LL</text>
<rect x="92" y="120" width="40" height="24" rx="3" fill="#dcfce7" stroke="#22c55e" stroke-width="1"/>
<text x="112" y="136" text-anchor="middle" font-size="8" fill="#15803d">LR</text>
<rect x="148" y="120" width="40" height="24" rx="3" fill="#dcfce7" stroke="#22c55e" stroke-width="1"/>
<text x="168" y="136" text-anchor="middle" font-size="8" fill="#15803d">RL</text>
<rect x="222" y="120" width="40" height="24" rx="3" fill="#dcfce7" stroke="#22c55e" stroke-width="1"/>
<text x="242" y="136" text-anchor="middle" font-size="8" fill="#15803d">RR</text>
<text x="140" y="165" text-anchor="middle" font-size="8" fill="#64748b">All nodes at same depth split together</text>
<text x="140" y="178" text-anchor="middle" font-size="8" fill="#64748b">Balanced → controlled by max_depth</text>
<!-- Leaf-wise tree -->
<rect x="380" y="20" width="70" height="28" rx="4" fill="#dbeafe" stroke="#3b82f6" stroke-width="1.5"/>
<text x="415" y="38" text-anchor="middle" font-size="9" fill="#1e40af">Root</text>
<line x1="395" y1="48" x2="355" y2="70" stroke="#94a3b8" stroke-width="1.5"/>
<line x1="435" y1="48" x2="475" y2="70" stroke="#94a3b8" stroke-width="1.5"/>
<rect x="325" y="70" width="60" height="28" rx="4" fill="#fef3c7" stroke="#f59e0b" stroke-width="1.5"/>
<text x="355" y="88" text-anchor="middle" font-size="9" fill="#92400e">L child</text>
<rect x="445" y="70" width="60" height="28" rx="4" fill="#dcfce7" stroke="#22c55e" stroke-width="1"/>
<text x="475" y="84" text-anchor="middle" font-size="9" fill="#15803d">R leaf</text>
<text x="475" y="95" text-anchor="middle" font-size="7" fill="#15803d">(low gain)</text>
<!-- Only the LEFT (high gain) child splits -->
<line x1="340" y1="98" x2="318" y2="120" stroke="#94a3b8" stroke-width="1.5"/>
<line x1="370" y1="98" x2="392" y2="120" stroke="#94a3b8" stroke-width="1.5"/>
<rect x="298" y="120" width="40" height="24" rx="3" fill="#fef3c7" stroke="#f59e0b" stroke-width="1"/>
<text x="318" y="136" text-anchor="middle" font-size="8" fill="#92400e">LL</text>
<rect x="372" y="120" width="40" height="24" rx="3" fill="#fef3c7" stroke="#f59e0b" stroke-width="1"/>
<text x="392" y="136" text-anchor="middle" font-size="8" fill="#92400e">LR</text>
<text x="335" y="136" font-size="7" fill="#f59e0b">↑ high gain</text>
<line x1="308" y1="144" x2="295" y2="165" stroke="#94a3b8" stroke-width="1.5"/>
<rect x="275" y="165" width="40" height="20" rx="3" fill="#dcfce7" stroke="#22c55e" stroke-width="1"/>
<text x="295" y="179" text-anchor="middle" font-size="8" fill="#15803d">LLL</text>
<text x="415" y="198" text-anchor="middle" font-size="8" fill="#64748b">Best-gain leaf always splits next</text>
<text x="415" y="210" text-anchor="middle" font-size="8" fill="#64748b">Unbalanced → deeper on high-gain paths</text>
Level-wise: all nodes at the same depth split together — controlled by max_depth. Leaf-wise: always split whichever existing leaf has the highest gain — faster convergence but risks deep paths that overfit. XGBoost uses level-wise by default; LightGBM uses leaf-wise with num_leaves to control depth.
Histogram-Based Split Finding
Vanilla GB (sklearn): for each feature, sort values → candidate thresholds → evaluations per feature per level → per level.
XGBoost: bin each feature into histogram buckets. Only thresholds per feature → per level. For samples, this reduces split-finding from to operations — ~4000× reduction.
The histogram is approximate: if the true optimal threshold falls between two bin boundaries, XGBoost uses the bin boundary. In practice, 256 bins is fine-grained enough that accuracy loss is negligible.
Missing Value Handling
XGBoost's sparsity-aware algorithm: during training, for each split candidate, evaluate both:
- Route all missing values LEFT
- Route all missing values RIGHT
Choose whichever direction reduces the objective more. At inference, the learned default direction is used for missing values. No imputation required — this is built into the split-finding algorithm.
import xgboost as xgb
import numpy as np
# Example: create data with NaN
X_with_nan = np.array([[1.0, np.nan], [2.0, 3.0], [np.nan, 4.0]])
y = np.array([0, 1, 1])
dtrain = xgb.DMatrix(X_with_nan, label=y)
# XGBoost handles NaN internally — no fillna neededTest Your Understanding
-
At the root, because . The root score in the Gain formula is . Does this mean the root contributes nothing to the Gain, or does it serve a different purpose in the formula? What would happen to the Gain formula if you started with a non-mean (say, )?
-
Optimal leaf weight is . For MSE, (not the usual residual sign). Verify: if we define pseudo-residuals as (same sign as vanilla GB), then . Show that when — i.e., XGBoost and vanilla GB agree at .
-
With : . The prediction update is . The true . The model barely moves from the global mean. How many rounds would it take to approximately converge to the correct leaf prediction if every round gives a step of only 2.23?
-
Histogram binning uses bins. For a feature with only 10 unique values (like
bedroomsin the house price dataset), the histogram has only 9 boundaries regardless of . In this case, histogram XGBoost and exact vanilla GB give identical splits. For which types of features does histogram approximation actually matter, and for which is it irrelevant? -
In leaf-wise tree growth (LightGBM), the model always splits the leaf with the highest gain. This can create one very deep branch while other branches stay as leaves. Why does LightGBM use
num_leaves(total number of leaf nodes) instead ofmax_depthto control model complexity? What's a model withnum_leaves=31andmax_depth=6vs one withnum_leaves=31andmax_depth=None— could they have the same structure?