Back to blog
← View series: machine learning

~/blog

XGBoost: Intuition and Math

Jun 26, 20268 min readBy Mohammed Vasim
Machine LearningAIData Science

sklearn's GradientBoostingRegressor and XGBoost both build trees on residuals. XGBoost differs in three ways: it uses the second-order Taylor approximation of the loss (not just the first-order residual), adds explicit regularization to the objective, and uses histogram-based split finding for speed. This post derives where those differences come from.

Anchor: 6-sample house prices — same as the Gradient Boosting post.

python
import numpy as np

X = np.array([650, 850, 1100, 1400, 1600, 1900])
y = np.array([180, 220, 280, 340, 370, 430])
# F₀ = mean(y) = 303.3 — same starting point as vanilla GB

XGBoost vs sklearn GradientBoosting

Aspectsklearn GradientBoostingXGBoost
Split findingExhaustive per levelHistogram-based ,
RegularizationNone built-inL1 () and L2 () on leaf weights
Missing valuesRequires imputationLearns split direction for missing
ParallelSequential, no parallelismColumn-parallel split finding
SpeedSlow on large datasets10–100× faster in practice
MemoryStores all samplesCache-aware column block structure
ObjectiveLoss onlyLoss + explicit regularization term
Taylor orderFirst-order (pseudo-residuals)Second-order (gradient + hessian)

XGBoost Objective Function

At tree , XGBoost minimizes a regularized objective using a second-order Taylor expansion of the loss:

Where:

  • first-order gradient (residual for MSE)
  • second-order gradient (hessian); for MSE:
  • — leaf weight for leaf (what we're optimizing)
  • — L2 regularization coefficient on leaf weights
  • — minimum gain required to create any split (pruning threshold)
  • — number of leaves (penalizes tree complexity)

For MSE loss: (opposite sign of residual), for all samples.

Optimal Leaf Weight Formula

Group samples in leaf as . Define:

Taking :

When : = mean residual — exactly vanilla GB.

When : leaf weights shrink toward zero. The larger , the more regularized the tree.

Gain Formula for Split Finding

The objective improvement from splitting a leaf into left () and right ():

Only create the split if Gain . sets the minimum gain threshold — larger prunes more aggressively.

Manual Trace: 6-Sample Anchor (Tree 1, , )

Gradient Table

(same mean). For MSE: (note: positive when predicting too high).

sq_ft
1650180303.3+123.31
2850220303.3+83.31
31100280303.3+23.31
41400340303.3−36.71
51600370303.3−66.71
61900430303.3−126.71

, .

because — the mean prediction cancels all gradients at the root.

Evaluate Split at sq_ft ≤ 1250

Left (samples 1,2,3): , .

Right (samples 4,5,6): , .

TermComputationValue
Left score
Right score
Root score
Gain

Optimal Leaf Weights

Update with (XGBoost default)

  • sq_ft ≤ 1250:
  • sq_ft > 1250:

Compare to vanilla GB (ν=0.1, leaf = mean residual = ±76.6): update was 303.3 ± 7.66, giving 295.6/311.0. XGBoost with λ=1 uses smaller leaf weights (±57.5) but larger ν (0.3), landing at similar positions — same effect, different parameterization.

Effect of λ (L2 Regularization)

Gain
017600
113225
104293
100543

Larger : leaf weights shrink toward zero (the tree corrects the residual less aggressively). Gain decreases — at high , splits that would have been created () may be rejected. provides a hard cutoff: if , the split is never created regardless of .

min_child_weight

XGBoost only creates a split if each child node satisfies .

  • MSE loss: , so . min_child_weight=5 requires at least 5 samples per leaf — identical to min_samples_leaf in sklearn.
  • Logistic loss: . . For probabilities near 0 or 1, . A leaf with 100 near-certain predictions can have — preventing overly confident leaves.

Level-Wise vs Leaf-Wise Tree Growth

Level-wise (sklearn GB, XGBoost default): Leaf-wise (LightGBM default): Round 1: Root splits. Round 1: Root splits. Round 2: Both children split. Round 2: Best child (highest gain) splits. Round 3: All 4 grandchildren split. Round 3: Best remaining leaf splits. → Balanced tree (max_depth constraint) → Unbalanced but higher total gain Level-Wise (XGBoost) Leaf-Wise (LightGBM) <!-- Level-wise tree --> <rect x="105" y="20" width="70" height="28" rx="4" fill="#dbeafe" stroke="#3b82f6" stroke-width="1.5"/> <text x="140" y="38" text-anchor="middle" font-size="9" fill="#1e40af">Root</text> <line x1="120" y1="48" x2="75" y2="70" stroke="#94a3b8" stroke-width="1.5"/> <line x1="160" y1="48" x2="205" y2="70" stroke="#94a3b8" stroke-width="1.5"/> <rect x="45" y="70" width="60" height="28" rx="4" fill="#fef3c7" stroke="#f59e0b" stroke-width="1.5"/> <text x="75" y="88" text-anchor="middle" font-size="9" fill="#92400e">L child</text> <rect x="175" y="70" width="60" height="28" rx="4" fill="#fef3c7" stroke="#f59e0b" stroke-width="1.5"/> <text x="205" y="88" text-anchor="middle" font-size="9" fill="#92400e">R child</text> <line x1="60" y1="98" x2="38" y2="120" stroke="#94a3b8" stroke-width="1.5"/> <line x1="90" y1="98" x2="112" y2="120" stroke="#94a3b8" stroke-width="1.5"/> <line x1="190" y1="98" x2="168" y2="120" stroke="#94a3b8" stroke-width="1.5"/> <line x1="220" y1="98" x2="242" y2="120" stroke="#94a3b8" stroke-width="1.5"/> <rect x="18" y="120" width="40" height="24" rx="3" fill="#dcfce7" stroke="#22c55e" stroke-width="1"/> <text x="38" y="136" text-anchor="middle" font-size="8" fill="#15803d">LL</text> <rect x="92" y="120" width="40" height="24" rx="3" fill="#dcfce7" stroke="#22c55e" stroke-width="1"/> <text x="112" y="136" text-anchor="middle" font-size="8" fill="#15803d">LR</text> <rect x="148" y="120" width="40" height="24" rx="3" fill="#dcfce7" stroke="#22c55e" stroke-width="1"/> <text x="168" y="136" text-anchor="middle" font-size="8" fill="#15803d">RL</text> <rect x="222" y="120" width="40" height="24" rx="3" fill="#dcfce7" stroke="#22c55e" stroke-width="1"/> <text x="242" y="136" text-anchor="middle" font-size="8" fill="#15803d">RR</text> <text x="140" y="165" text-anchor="middle" font-size="8" fill="#64748b">All nodes at same depth split together</text> <text x="140" y="178" text-anchor="middle" font-size="8" fill="#64748b">Balanced → controlled by max_depth</text> <!-- Leaf-wise tree --> <rect x="380" y="20" width="70" height="28" rx="4" fill="#dbeafe" stroke="#3b82f6" stroke-width="1.5"/> <text x="415" y="38" text-anchor="middle" font-size="9" fill="#1e40af">Root</text> <line x1="395" y1="48" x2="355" y2="70" stroke="#94a3b8" stroke-width="1.5"/> <line x1="435" y1="48" x2="475" y2="70" stroke="#94a3b8" stroke-width="1.5"/> <rect x="325" y="70" width="60" height="28" rx="4" fill="#fef3c7" stroke="#f59e0b" stroke-width="1.5"/> <text x="355" y="88" text-anchor="middle" font-size="9" fill="#92400e">L child</text> <rect x="445" y="70" width="60" height="28" rx="4" fill="#dcfce7" stroke="#22c55e" stroke-width="1"/> <text x="475" y="84" text-anchor="middle" font-size="9" fill="#15803d">R leaf</text> <text x="475" y="95" text-anchor="middle" font-size="7" fill="#15803d">(low gain)</text> <!-- Only the LEFT (high gain) child splits --> <line x1="340" y1="98" x2="318" y2="120" stroke="#94a3b8" stroke-width="1.5"/> <line x1="370" y1="98" x2="392" y2="120" stroke="#94a3b8" stroke-width="1.5"/> <rect x="298" y="120" width="40" height="24" rx="3" fill="#fef3c7" stroke="#f59e0b" stroke-width="1"/> <text x="318" y="136" text-anchor="middle" font-size="8" fill="#92400e">LL</text> <rect x="372" y="120" width="40" height="24" rx="3" fill="#fef3c7" stroke="#f59e0b" stroke-width="1"/> <text x="392" y="136" text-anchor="middle" font-size="8" fill="#92400e">LR</text> <text x="335" y="136" font-size="7" fill="#f59e0b">↑ high gain</text> <line x1="308" y1="144" x2="295" y2="165" stroke="#94a3b8" stroke-width="1.5"/> <rect x="275" y="165" width="40" height="20" rx="3" fill="#dcfce7" stroke="#22c55e" stroke-width="1"/> <text x="295" y="179" text-anchor="middle" font-size="8" fill="#15803d">LLL</text> <text x="415" y="198" text-anchor="middle" font-size="8" fill="#64748b">Best-gain leaf always splits next</text> <text x="415" y="210" text-anchor="middle" font-size="8" fill="#64748b">Unbalanced → deeper on high-gain paths</text>

Level-wise: all nodes at the same depth split together — controlled by max_depth. Leaf-wise: always split whichever existing leaf has the highest gain — faster convergence but risks deep paths that overfit. XGBoost uses level-wise by default; LightGBM uses leaf-wise with num_leaves to control depth.

Histogram-Based Split Finding

Vanilla GB (sklearn): for each feature, sort values → candidate thresholds → evaluations per feature per level → per level.

XGBoost: bin each feature into histogram buckets. Only thresholds per feature → per level. For samples, this reduces split-finding from to operations — ~4000× reduction.

The histogram is approximate: if the true optimal threshold falls between two bin boundaries, XGBoost uses the bin boundary. In practice, 256 bins is fine-grained enough that accuracy loss is negligible.

Missing Value Handling

XGBoost's sparsity-aware algorithm: during training, for each split candidate, evaluate both:

  1. Route all missing values LEFT
  2. Route all missing values RIGHT

Choose whichever direction reduces the objective more. At inference, the learned default direction is used for missing values. No imputation required — this is built into the split-finding algorithm.

python
import xgboost as xgb
import numpy as np

# Example: create data with NaN
X_with_nan = np.array([[1.0, np.nan], [2.0, 3.0], [np.nan, 4.0]])
y = np.array([0, 1, 1])

dtrain = xgb.DMatrix(X_with_nan, label=y)
# XGBoost handles NaN internally — no fillna needed

Test Your Understanding

  1. At the root, because . The root score in the Gain formula is . Does this mean the root contributes nothing to the Gain, or does it serve a different purpose in the formula? What would happen to the Gain formula if you started with a non-mean (say, )?

  2. Optimal leaf weight is . For MSE, (not the usual residual sign). Verify: if we define pseudo-residuals as (same sign as vanilla GB), then . Show that when — i.e., XGBoost and vanilla GB agree at .

  3. With : . The prediction update is . The true . The model barely moves from the global mean. How many rounds would it take to approximately converge to the correct leaf prediction if every round gives a step of only 2.23?

  4. Histogram binning uses bins. For a feature with only 10 unique values (like bedrooms in the house price dataset), the histogram has only 9 boundaries regardless of . In this case, histogram XGBoost and exact vanilla GB give identical splits. For which types of features does histogram approximation actually matter, and for which is it irrelevant?

  5. In leaf-wise tree growth (LightGBM), the model always splits the leaf with the highest gain. This can create one very deep branch while other branches stay as leaves. Why does LightGBM use num_leaves (total number of leaf nodes) instead of max_depth to control model complexity? What's a model with num_leaves=31 and max_depth=6 vs one with num_leaves=31 and max_depth=None — could they have the same structure?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment