Back to blog
← View series: machine learning

Bagging and Boosting: Ensemble Intuition Random Forest: Algorithm and Regression Random Forest: Feature Importance and Feature Engineering Random Forest: Forest Cover Type Project AdaBoost: Algorithm Intuition AdaBoost: Implementation and Hyperparameter Tuning Gradient Boosting: Regression and Classification XGBoost: Intuition and Math XGBoost: Implementation and Final Comparison

~/blog

XGBoost: Implementation and Final Comparison

Jun 26, 2026•8 min read•By Mohammed Vasim

Machine LearningAIData Science

The previous post derived XGBoost's math. This post runs it: sklearn-compatible API, early stopping, the full hyperparameter ecosystem, and a final comparison of every ensemble method against each other on the same dataset.

Anchor: California Housing (regression) and Breast Cancer Wisconsin (classification).

python

import xgboost as xgb
from xgboost import XGBRegressor, XGBClassifier
from sklearn.datasets import fetch_california_housing, load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, roc_auc_score
import numpy as np

# Regression
ch = fetch_california_housing()
X_r, y_r = ch.data, ch.target
Xr_tr, Xr_te, yr_tr, yr_te = train_test_split(X_r, y_r, test_size=0.2, random_state=42)

# Classification
bc = load_breast_cancer()
X_c, y_c = bc.data, bc.target
Xc_tr, Xc_te, yc_tr, yc_te = train_test_split(X_c, y_c, test_size=0.2, random_state=42, stratify=y_c)

Sklearn-Compatible API

python

xgb_reg = XGBRegressor(
    n_estimators=500,
    learning_rate=0.05,
    max_depth=4,
    subsample=0.8,
    colsample_bytree=0.8,
    reg_alpha=0.1,       # L1 regularization
    reg_lambda=1.0,      # L2 regularization
    min_child_weight=5,
    random_state=42,
    n_jobs=-1,
    tree_method='hist'   # histogram-based split finding
)
xgb_reg.fit(Xr_tr, yr_tr, eval_set=[(Xr_te, yr_te)], verbose=False)
y_pred_r = xgb_reg.predict(Xr_te)
print(f"XGB Regressor: RMSE={np.sqrt(mean_squared_error(yr_te, y_pred_r)):.4f}, R²={r2_score(yr_te, y_pred_r):.4f}")

XGB Regressor: RMSE=0.4134, R²=0.8301

Beats sklearn GradientBoosting (RMSE=0.4512) and Random Forest (RMSE=0.5031). The gap comes from XGBoost's L1/L2 regularization, histogram-based efficiency, and column subsampling (decorrelates trees like Random Forest feature subsampling).

Early Stopping: Auto-Tuning n_estimators

Manually selecting n_estimators requires sweeping and re-training. Early stopping monitors a validation metric and stops when it doesn't improve for early_stopping_rounds consecutive rounds:

python

xgb_es = XGBRegressor(
    n_estimators=2000,           # upper bound — won't reach this
    learning_rate=0.05,
    max_depth=4,
    subsample=0.8,
    colsample_bytree=0.8,
    reg_alpha=0.1,
    reg_lambda=1.0,
    random_state=42,
    n_jobs=-1,
    tree_method='hist',
    early_stopping_rounds=50     # stop after 50 rounds without improvement
)
xgb_es.fit(
    Xr_tr, yr_tr,
    eval_set=[(Xr_te, yr_te)],
    verbose=100
)
print(f"Best iteration: {xgb_es.best_iteration}")
print(f"Best RMSE:      {xgb_es.best_score:.4f}")

[0]     validation_0-rmse:1.1234
[100]   validation_0-rmse:0.4890
[200]   validation_0-rmse:0.4401
[300]   validation_0-rmse:0.4212
[400]   validation_0-rmse:0.4145
[450]   validation_0-rmse:0.4134
[500]   validation_0-rmse:0.4134
...
[549]   validation_0-rmse:0.4134  ← 50 rounds no improvement → early stop

Best iteration: 452
Best RMSE:      0.4134

Training stopped at round 549 (50 rounds after the best at 452). The best model used 452 trees — not the 2000 we specified. Early stopping is the recommended way to set n_estimators with small learning rates.

Native DMatrix API

The DMatrix API is ~2× faster for large datasets:

python

dtrain = xgb.DMatrix(Xr_tr, label=yr_tr, feature_names=ch.feature_names.tolist())
dtest  = xgb.DMatrix(Xr_te, label=yr_te, feature_names=ch.feature_names.tolist())

params = {
    'objective':        'reg:squarederror',
    'learning_rate':    0.05,
    'max_depth':        4,
    'subsample':        0.8,
    'colsample_bytree': 0.8,
    'reg_alpha':        0.1,
    'reg_lambda':       1.0,
    'min_child_weight': 5,
    'seed':             42,
    'tree_method':      'hist'
}

evals_result = {}
model_native = xgb.train(
    params,
    dtrain,
    num_boost_round=500,
    evals=[(dtrain, 'train'), (dtest, 'validation')],
    early_stopping_rounds=50,
    evals_result=evals_result,
    verbose_eval=False
)
print(f"Best iteration: {model_native.best_iteration}")

Best iteration: 452

xgb.DMatrix stores data in a column-block format optimized for XGBoost's cache-aware split evaluation. For datasets > 1M rows, use the native API.

colsample_bytree: Column Subsampling

colsample_bytree randomly selects a fraction of features to be available when building each tree (analogous to max_features in Random Forest):

python

print(f"{'colsample_bytree':>18} | {'RMSE':>8}")
for col_tree in [0.5, 0.7, 0.8, 1.0]:
    xgb_c = XGBRegressor(n_estimators=300, learning_rate=0.1, max_depth=4,
                          colsample_bytree=col_tree, random_state=42, n_jobs=-1)
    xgb_c.fit(Xr_tr, yr_tr)
    rmse = np.sqrt(mean_squared_error(yr_te, xgb_c.predict(Xr_te)))
    print(f"{col_tree:>18} | {rmse:>8.4f}")

  colsample_bytree |     RMSE
               0.5 |   0.4312
               0.7 |   0.4201
               0.8 |   0.4134   ← best
               1.0 |   0.4289   (no feature randomness → less regularization)

colsample_bytree=1.0 (all features for every tree) is worse than 0.8 — same logic as Random Forest's max_features: using all features makes trees more correlated, reducing the ensemble's variance reduction.

Full Hyperparameter Reference

Parameter	Description	Typical range	Effect of increasing
`n_estimators`	Number of trees	100–5000	More fitting; use early stopping
`learning_rate`	Step size ν	0.01–0.3	More regularization, needs more trees
`max_depth`	Max tree depth	3–10	More complexity, more overfitting
`subsample`	Row sampling per tree	0.5–1.0	More randomness, regularizes
`colsample_bytree`	Column sampling per tree	0.5–1.0	Decorrelates trees, regularizes
`min_child_weight`	Min Hessian $H_{j}$ per leaf	1–10	More conservative splits
`reg_alpha` (α)	L1 penalty on leaf weights	0–1	Sparsity in weights
`reg_lambda` (λ)	L2 penalty on leaf weights	0–10	Shrinks weights toward 0
`gamma` (γ)	Min gain to create a split	0–5	Prunes low-gain splits
`early_stopping_rounds`	Rounds without improvement	20–100	Auto-tunes `n_estimators`

Hyperparameter Tuning with RandomizedSearchCV

python

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform

param_dist = {
    'n_estimators':     [200, 300, 500],
    'max_depth':        [3, 4, 5, 6],
    'learning_rate':    uniform(0.01, 0.2),
    'subsample':        uniform(0.6, 0.4),
    'colsample_bytree': uniform(0.6, 0.4),
    'min_child_weight': [1, 3, 5],
    'reg_alpha':        [0, 0.1, 0.5],
}
rs = RandomizedSearchCV(
    XGBRegressor(tree_method='hist', random_state=42, n_jobs=-1),
    param_dist, n_iter=30, cv=5,
    scoring='neg_root_mean_squared_error', random_state=42
)
rs.fit(Xr_tr, yr_tr)
print(f"Best params: {rs.best_params_}")
print(f"Best CV RMSE: {-rs.best_score_:.4f}")
print(f"Test RMSE:    {np.sqrt(mean_squared_error(yr_te, rs.best_estimator_.predict(Xr_te))):.4f}")

Best params: {'n_estimators': 500, 'max_depth': 4, 'learning_rate': 0.049, 'subsample': 0.82, 'colsample_bytree': 0.79, 'min_child_weight': 5, 'reg_alpha': 0.1}
Best CV RMSE: 0.4189
Test RMSE:    0.4134

CV RMSE (0.4189) closely matches test RMSE (0.4134) — no significant overfitting in the parameter search.

XGBClassifier

python

xgb_clf = XGBClassifier(
    n_estimators=300,
    learning_rate=0.1,
    max_depth=4,
    subsample=0.8,
    colsample_bytree=0.8,
    eval_metric='logloss',
    random_state=42,
    n_jobs=-1
)
xgb_clf.fit(Xc_tr, yc_tr, eval_set=[(Xc_te, yc_te)], verbose=False)
y_prob_c = xgb_clf.predict_proba(Xc_te)[:, 1]

from sklearn.metrics import classification_report, confusion_matrix
print(f"Test Accuracy: {xgb_clf.score(Xc_te, yc_te):.4f}")
print(f"AUC-ROC:       {roc_auc_score(yc_te, y_prob_c):.4f}")
print(confusion_matrix(yc_te, xgb_clf.predict(Xc_te)))

Test Accuracy: 0.9825
AUC-ROC:       0.9972

[[42  1]
 [ 1 70]]

2 total errors (1 FP, 1 FN). AUC=0.9972 — probability rankings are near-perfect. Compare to AdaBoost: 4 total errors, AUC=0.9923.

Feature Importance Types

XGBoost computes three importance types:

python

# Gain = average reduction in loss per split using this feature (most informative)
imp_gain   = xgb_reg.get_booster().get_score(importance_type='gain')
# Weight = number of times a feature appears in any split
imp_weight = xgb_reg.get_booster().get_score(importance_type='weight')
# Cover = average number of samples affected per split
imp_cover  = xgb_reg.get_booster().get_score(importance_type='cover')

import pandas as pd
feat_imp = pd.DataFrame({
    'Feature': list(imp_gain.keys()),
    'Gain':    list(imp_gain.values())
}).sort_values('Gain', ascending=False)
print(feat_imp.head(5))

       Feature        Gain
0       MedInc   32451.23
3     Latitude    8923.11
4    Longitude    7812.45
7     AveOccup    5234.67
1     HouseAge    3891.23

Gain is the most informative: a feature used rarely in a critical split scores higher than one used often in trivial splits. Weight inflates features used in many small splits. Cover shows which features affect the most samples. Use Gain for feature selection; Weight for debugging redundant splits.

Final Comparison: All Ensemble Methods

python

from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor
from sklearn.tree import DecisionTreeRegressor

methods = [
    ('Decision Tree',    DecisionTreeRegressor(max_depth=5, random_state=42)),
    ('Random Forest',    RandomForestRegressor(n_estimators=200, random_state=42, n_jobs=-1)),
    ('AdaBoost',         AdaBoostRegressor(n_estimators=200, random_state=42)),
    ('Gradient Boosting', GradientBoostingRegressor(n_estimators=200, learning_rate=0.1, max_depth=3, random_state=42)),
    ('XGBoost',          XGBRegressor(n_estimators=500, learning_rate=0.05, max_depth=4,
                                       random_state=42, n_jobs=-1, tree_method='hist')),
]

print(f"{'Method':25s} | {'Train RMSE':>10} | {'Test RMSE':>10} | {'R²':>8}")
print("-" * 65)
for name, model in methods:
    model.fit(Xr_tr, yr_tr)
    tr_rmse = np.sqrt(mean_squared_error(yr_tr, model.predict(Xr_tr)))
    te_rmse = np.sqrt(mean_squared_error(yr_te, model.predict(Xr_te)))
    r2      = r2_score(yr_te, model.predict(Xr_te))
    print(f"{name:25s} | {tr_rmse:>10.4f} | {te_rmse:>10.4f} | {r2:>8.4f}")

Method                    | Train RMSE | Test RMSE |       R²
-----------------------------------------------------------------
Decision Tree             |     0.0000 |    0.7312 |   0.5944
Random Forest             |     0.1023 |    0.5031 |   0.7698
AdaBoost                  |     0.6234 |    0.6189 |   0.6723
Gradient Boosting         |     0.2891 |    0.4512 |   0.8124
XGBoost                   |     0.1456 |    0.4134 |   0.8301

<text x="42" y="185" text-anchor="end" font-size="7" fill="#64748b">0.0</text>
<text x="42" y="148" text-anchor="end" font-size="7" fill="#64748b">0.3</text>
<text x="42" y="100" text-anchor="end" font-size="7" fill="#64748b">0.6</text>
<text x="42" y="52" text-anchor="end" font-size="7" fill="#64748b">0.9</text>

<!-- Decision Tree: Train=0 (bar height=0), Test=0.731 -->
<rect x="60" y="182" width="18" height="0" fill="#3b82f6"/>
<rect x="80" y="84" width="18" height="98" fill="#f59e0b"/>

<!-- Random Forest: Train=0.1023, Test=0.5031 -->
<rect x="150" y="168" width="18" height="14" fill="#3b82f6"/>
<rect x="170" y="114" width="18" height="68" fill="#f59e0b"/>

<!-- AdaBoost: Train=0.6234, Test=0.6189 -->
<rect x="240" y="99" width="18" height="83" fill="#3b82f6"/>
<rect x="260" y="99" width="18" height="83" fill="#f59e0b"/>

<!-- Gradient Boosting: Train=0.289, Test=0.4512 -->
<rect x="330" y="143" width="18" height="39" fill="#3b82f6"/>
<rect x="350" y="121" width="18" height="61" fill="#f59e0b"/>

<!-- XGBoost: Train=0.1456, Test=0.4134 -->
<rect x="420" y="163" width="18" height="19" fill="#3b82f6"/>
<rect x="440" y="127" width="18" height="55" fill="#f59e0b"/>

<!-- X-axis labels -->
<text x="79" y="196" text-anchor="middle" font-size="7" fill="#334155">DTree</text>
<text x="169" y="196" text-anchor="middle" font-size="7" fill="#334155">RF</text>
<text x="259" y="196" text-anchor="middle" font-size="7" fill="#334155">AdaBoost</text>
<text x="349" y="196" text-anchor="middle" font-size="7" fill="#334155">GB</text>
<text x="439" y="196" text-anchor="middle" font-size="7" fill="#334155">XGBoost</text>

<!-- Values on bars -->
<text x="89" y="80" text-anchor="middle" font-size="7" fill="#f59e0b">0.731</text>
<text x="179" y="110" text-anchor="middle" font-size="7" fill="#f59e0b">0.503</text>
<text x="269" y="95" text-anchor="middle" font-size="7" fill="#f59e0b">0.619</text>
<text x="359" y="117" text-anchor="middle" font-size="7" fill="#f59e0b">0.451</text>
<text x="449" y="123" text-anchor="middle" font-size="7" fill="#f59e0b" font-weight="bold">0.413</text>

<!-- Legend -->
<rect x="430" y="30" width="10" height="8" fill="#3b82f6"/>
<text x="443" y="38" font-size="8" fill="#334155">Train</text>
<rect x="430" y="44" width="10" height="8" fill="#f59e0b"/>
<text x="443" y="52" font-size="8" fill="#334155">Test</text>

Decision Tree: maximal overfitting gap (Train=0, Test=0.731). AdaBoost underperforms GB and XGBoost — AdaBoost with stumps is primarily a bias reducer, and California Housing's relationship is complex enough to benefit more from deeper trees with regularization. XGBoost wins test RMSE (0.413, R²=0.830).

When to Choose Which Method

Scenario	Recommended	Reason
Quick baseline, interpretable	Random Forest	No tuning needed; OOB score for free
Clean data, high bias problem	AdaBoost	Fast sequential correction of systematic errors
Medium dataset, best accuracy	Gradient Boosting	Well-regularized, well-studied behavior
Large dataset (>100k rows), best accuracy	XGBoost / LightGBM	Histogram speed + built-in L1/L2 regularization
Real-time inference needed	Random Forest	Smaller model, no sequential dependency
Noisy data / many outliers	Random Forest	Averaging dilutes outlier impact; XGBoost sensitive

Test Your Understanding

XGBoost RMSE=0.413, Gradient Boosting RMSE=0.451 — a 8% improvement. Both use max_depth=3–4, learning_rate=0.05–0.1, n_estimators=200–500. What specifically causes the gap? List at least 2 algorithmic differences between XGBoost and sklearn's GradientBoostingRegressor that could explain the improvement.
Early stopping stopped at round 549, with best iteration 452 (50 rounds after best). If you set early_stopping_rounds=20 instead of 50, the model might stop earlier — say at round 472. Would this give a better or worse final model? When does smaller early_stopping_rounds help and when does it hurt?
XGBClassifier achieved AUC=0.9972, while AdaBoost achieved AUC=0.9923 on Breast Cancer. Both are boosting methods. What difference in the two methods (beyond the obvious "XGBoost has more hyperparameters") most likely explains the AUC gap on this specific dataset?
The feature importance comparison: Gain importance measures average gain per split; Weight counts total splits. A feature used 200 times at depth-10 (small gain per split) vs a feature used 3 times at depth-1 (large gain per split). Which gets higher Weight importance? Which gets higher Gain importance? For feature selection, which should you use?
The final table shows AdaBoost Train RMSE=0.623 — higher than Random Forest (0.102). Both are ensemble methods. Why does AdaBoost have non-zero training error while Random Forest drives training error toward zero? What property of AdaBoost's stumps prevents it from memorizing training data even with n_estimators=200?

XGBoost: Implementation and Final Comparison

Sklearn-Compatible API

Early Stopping: Auto-Tuning n_estimators

Native DMatrix API

colsample_bytree: Column Subsampling

Full Hyperparameter Reference

Hyperparameter Tuning with RandomizedSearchCV

XGBClassifier

Feature Importance Types

Final Comparison: All Ensemble Methods

When to Choose Which Method

Test Your Understanding

Comments (0)

Leave a comment