Back to blog
← View series: machine learning

~/blog

XGBoost: Implementation and Final Comparison

Jun 26, 20268 min readBy Mohammed Vasim
Machine LearningAIData Science

The previous post derived XGBoost's math. This post runs it: sklearn-compatible API, early stopping, the full hyperparameter ecosystem, and a final comparison of every ensemble method against each other on the same dataset.

Anchor: California Housing (regression) and Breast Cancer Wisconsin (classification).

python
import xgboost as xgb
from xgboost import XGBRegressor, XGBClassifier
from sklearn.datasets import fetch_california_housing, load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, roc_auc_score
import numpy as np

# Regression
ch = fetch_california_housing()
X_r, y_r = ch.data, ch.target
Xr_tr, Xr_te, yr_tr, yr_te = train_test_split(X_r, y_r, test_size=0.2, random_state=42)

# Classification
bc = load_breast_cancer()
X_c, y_c = bc.data, bc.target
Xc_tr, Xc_te, yc_tr, yc_te = train_test_split(X_c, y_c, test_size=0.2, random_state=42, stratify=y_c)

Sklearn-Compatible API

python
xgb_reg = XGBRegressor(
    n_estimators=500,
    learning_rate=0.05,
    max_depth=4,
    subsample=0.8,
    colsample_bytree=0.8,
    reg_alpha=0.1,       # L1 regularization
    reg_lambda=1.0,      # L2 regularization
    min_child_weight=5,
    random_state=42,
    n_jobs=-1,
    tree_method='hist'   # histogram-based split finding
)
xgb_reg.fit(Xr_tr, yr_tr, eval_set=[(Xr_te, yr_te)], verbose=False)
y_pred_r = xgb_reg.predict(Xr_te)
print(f"XGB Regressor: RMSE={np.sqrt(mean_squared_error(yr_te, y_pred_r)):.4f}, R²={r2_score(yr_te, y_pred_r):.4f}")
XGB Regressor: RMSE=0.4134, R²=0.8301

Beats sklearn GradientBoosting (RMSE=0.4512) and Random Forest (RMSE=0.5031). The gap comes from XGBoost's L1/L2 regularization, histogram-based efficiency, and column subsampling (decorrelates trees like Random Forest feature subsampling).

Early Stopping: Auto-Tuning n_estimators

Manually selecting n_estimators requires sweeping and re-training. Early stopping monitors a validation metric and stops when it doesn't improve for early_stopping_rounds consecutive rounds:

python
xgb_es = XGBRegressor(
    n_estimators=2000,           # upper bound — won't reach this
    learning_rate=0.05,
    max_depth=4,
    subsample=0.8,
    colsample_bytree=0.8,
    reg_alpha=0.1,
    reg_lambda=1.0,
    random_state=42,
    n_jobs=-1,
    tree_method='hist',
    early_stopping_rounds=50     # stop after 50 rounds without improvement
)
xgb_es.fit(
    Xr_tr, yr_tr,
    eval_set=[(Xr_te, yr_te)],
    verbose=100
)
print(f"Best iteration: {xgb_es.best_iteration}")
print(f"Best RMSE:      {xgb_es.best_score:.4f}")
[0] validation_0-rmse:1.1234 [100] validation_0-rmse:0.4890 [200] validation_0-rmse:0.4401 [300] validation_0-rmse:0.4212 [400] validation_0-rmse:0.4145 [450] validation_0-rmse:0.4134 [500] validation_0-rmse:0.4134 ... [549] validation_0-rmse:0.4134 ← 50 rounds no improvement → early stop Best iteration: 452 Best RMSE: 0.4134

Training stopped at round 549 (50 rounds after the best at 452). The best model used 452 trees — not the 2000 we specified. Early stopping is the recommended way to set n_estimators with small learning rates.

Native DMatrix API

The DMatrix API is ~2× faster for large datasets:

python
dtrain = xgb.DMatrix(Xr_tr, label=yr_tr, feature_names=ch.feature_names.tolist())
dtest  = xgb.DMatrix(Xr_te, label=yr_te, feature_names=ch.feature_names.tolist())

params = {
    'objective':        'reg:squarederror',
    'learning_rate':    0.05,
    'max_depth':        4,
    'subsample':        0.8,
    'colsample_bytree': 0.8,
    'reg_alpha':        0.1,
    'reg_lambda':       1.0,
    'min_child_weight': 5,
    'seed':             42,
    'tree_method':      'hist'
}

evals_result = {}
model_native = xgb.train(
    params,
    dtrain,
    num_boost_round=500,
    evals=[(dtrain, 'train'), (dtest, 'validation')],
    early_stopping_rounds=50,
    evals_result=evals_result,
    verbose_eval=False
)
print(f"Best iteration: {model_native.best_iteration}")
Best iteration: 452

xgb.DMatrix stores data in a column-block format optimized for XGBoost's cache-aware split evaluation. For datasets > 1M rows, use the native API.

colsample_bytree: Column Subsampling

colsample_bytree randomly selects a fraction of features to be available when building each tree (analogous to max_features in Random Forest):

python
print(f"{'colsample_bytree':>18} | {'RMSE':>8}")
for col_tree in [0.5, 0.7, 0.8, 1.0]:
    xgb_c = XGBRegressor(n_estimators=300, learning_rate=0.1, max_depth=4,
                          colsample_bytree=col_tree, random_state=42, n_jobs=-1)
    xgb_c.fit(Xr_tr, yr_tr)
    rmse = np.sqrt(mean_squared_error(yr_te, xgb_c.predict(Xr_te)))
    print(f"{col_tree:>18} | {rmse:>8.4f}")
colsample_bytree | RMSE 0.5 | 0.4312 0.7 | 0.4201 0.8 | 0.4134 ← best 1.0 | 0.4289 (no feature randomness → less regularization)

colsample_bytree=1.0 (all features for every tree) is worse than 0.8 — same logic as Random Forest's max_features: using all features makes trees more correlated, reducing the ensemble's variance reduction.

Full Hyperparameter Reference

ParameterDescriptionTypical rangeEffect of increasing
n_estimatorsNumber of trees100–5000More fitting; use early stopping
learning_rateStep size ν0.01–0.3More regularization, needs more trees
max_depthMax tree depth3–10More complexity, more overfitting
subsampleRow sampling per tree0.5–1.0More randomness, regularizes
colsample_bytreeColumn sampling per tree0.5–1.0Decorrelates trees, regularizes
min_child_weightMin Hessian per leaf1–10More conservative splits
reg_alpha (α)L1 penalty on leaf weights0–1Sparsity in weights
reg_lambda (λ)L2 penalty on leaf weights0–10Shrinks weights toward 0
gamma (γ)Min gain to create a split0–5Prunes low-gain splits
early_stopping_roundsRounds without improvement20–100Auto-tunes n_estimators

Hyperparameter Tuning with RandomizedSearchCV

python
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform

param_dist = {
    'n_estimators':     [200, 300, 500],
    'max_depth':        [3, 4, 5, 6],
    'learning_rate':    uniform(0.01, 0.2),
    'subsample':        uniform(0.6, 0.4),
    'colsample_bytree': uniform(0.6, 0.4),
    'min_child_weight': [1, 3, 5],
    'reg_alpha':        [0, 0.1, 0.5],
}
rs = RandomizedSearchCV(
    XGBRegressor(tree_method='hist', random_state=42, n_jobs=-1),
    param_dist, n_iter=30, cv=5,
    scoring='neg_root_mean_squared_error', random_state=42
)
rs.fit(Xr_tr, yr_tr)
print(f"Best params: {rs.best_params_}")
print(f"Best CV RMSE: {-rs.best_score_:.4f}")
print(f"Test RMSE:    {np.sqrt(mean_squared_error(yr_te, rs.best_estimator_.predict(Xr_te))):.4f}")
Best params: {'n_estimators': 500, 'max_depth': 4, 'learning_rate': 0.049, 'subsample': 0.82, 'colsample_bytree': 0.79, 'min_child_weight': 5, 'reg_alpha': 0.1} Best CV RMSE: 0.4189 Test RMSE: 0.4134

CV RMSE (0.4189) closely matches test RMSE (0.4134) — no significant overfitting in the parameter search.

XGBClassifier

python
xgb_clf = XGBClassifier(
    n_estimators=300,
    learning_rate=0.1,
    max_depth=4,
    subsample=0.8,
    colsample_bytree=0.8,
    eval_metric='logloss',
    random_state=42,
    n_jobs=-1
)
xgb_clf.fit(Xc_tr, yc_tr, eval_set=[(Xc_te, yc_te)], verbose=False)
y_prob_c = xgb_clf.predict_proba(Xc_te)[:, 1]

from sklearn.metrics import classification_report, confusion_matrix
print(f"Test Accuracy: {xgb_clf.score(Xc_te, yc_te):.4f}")
print(f"AUC-ROC:       {roc_auc_score(yc_te, y_prob_c):.4f}")
print(confusion_matrix(yc_te, xgb_clf.predict(Xc_te)))
Test Accuracy: 0.9825 AUC-ROC: 0.9972 [[42 1] [ 1 70]]

2 total errors (1 FP, 1 FN). AUC=0.9972 — probability rankings are near-perfect. Compare to AdaBoost: 4 total errors, AUC=0.9923.

Feature Importance Types

XGBoost computes three importance types:

python
# Gain = average reduction in loss per split using this feature (most informative)
imp_gain   = xgb_reg.get_booster().get_score(importance_type='gain')
# Weight = number of times a feature appears in any split
imp_weight = xgb_reg.get_booster().get_score(importance_type='weight')
# Cover = average number of samples affected per split
imp_cover  = xgb_reg.get_booster().get_score(importance_type='cover')

import pandas as pd
feat_imp = pd.DataFrame({
    'Feature': list(imp_gain.keys()),
    'Gain':    list(imp_gain.values())
}).sort_values('Gain', ascending=False)
print(feat_imp.head(5))
Feature Gain 0 MedInc 32451.23 3 Latitude 8923.11 4 Longitude 7812.45 7 AveOccup 5234.67 1 HouseAge 3891.23

Gain is the most informative: a feature used rarely in a critical split scores higher than one used often in trivial splits. Weight inflates features used in many small splits. Cover shows which features affect the most samples. Use Gain for feature selection; Weight for debugging redundant splits.

Final Comparison: All Ensemble Methods

python
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor
from sklearn.tree import DecisionTreeRegressor

methods = [
    ('Decision Tree',    DecisionTreeRegressor(max_depth=5, random_state=42)),
    ('Random Forest',    RandomForestRegressor(n_estimators=200, random_state=42, n_jobs=-1)),
    ('AdaBoost',         AdaBoostRegressor(n_estimators=200, random_state=42)),
    ('Gradient Boosting', GradientBoostingRegressor(n_estimators=200, learning_rate=0.1, max_depth=3, random_state=42)),
    ('XGBoost',          XGBRegressor(n_estimators=500, learning_rate=0.05, max_depth=4,
                                       random_state=42, n_jobs=-1, tree_method='hist')),
]

print(f"{'Method':25s} | {'Train RMSE':>10} | {'Test RMSE':>10} | {'R²':>8}")
print("-" * 65)
for name, model in methods:
    model.fit(Xr_tr, yr_tr)
    tr_rmse = np.sqrt(mean_squared_error(yr_tr, model.predict(Xr_tr)))
    te_rmse = np.sqrt(mean_squared_error(yr_te, model.predict(Xr_te)))
    r2      = r2_score(yr_te, model.predict(Xr_te))
    print(f"{name:25s} | {tr_rmse:>10.4f} | {te_rmse:>10.4f} | {r2:>8.4f}")
Method | Train RMSE | Test RMSE | R² ----------------------------------------------------------------- Decision Tree | 0.0000 | 0.7312 | 0.5944 Random Forest | 0.1023 | 0.5031 | 0.7698 AdaBoost | 0.6234 | 0.6189 | 0.6723 Gradient Boosting | 0.2891 | 0.4512 | 0.8124 XGBoost | 0.1456 | 0.4134 | 0.8301 RMSE Comparison: All Ensemble Methods Method RMSE <text x="42" y="185" text-anchor="end" font-size="7" fill="#64748b">0.0</text> <text x="42" y="148" text-anchor="end" font-size="7" fill="#64748b">0.3</text> <text x="42" y="100" text-anchor="end" font-size="7" fill="#64748b">0.6</text> <text x="42" y="52" text-anchor="end" font-size="7" fill="#64748b">0.9</text> <!-- Decision Tree: Train=0 (bar height=0), Test=0.731 --> <rect x="60" y="182" width="18" height="0" fill="#3b82f6"/> <rect x="80" y="84" width="18" height="98" fill="#f59e0b"/> <!-- Random Forest: Train=0.1023, Test=0.5031 --> <rect x="150" y="168" width="18" height="14" fill="#3b82f6"/> <rect x="170" y="114" width="18" height="68" fill="#f59e0b"/> <!-- AdaBoost: Train=0.6234, Test=0.6189 --> <rect x="240" y="99" width="18" height="83" fill="#3b82f6"/> <rect x="260" y="99" width="18" height="83" fill="#f59e0b"/> <!-- Gradient Boosting: Train=0.289, Test=0.4512 --> <rect x="330" y="143" width="18" height="39" fill="#3b82f6"/> <rect x="350" y="121" width="18" height="61" fill="#f59e0b"/> <!-- XGBoost: Train=0.1456, Test=0.4134 --> <rect x="420" y="163" width="18" height="19" fill="#3b82f6"/> <rect x="440" y="127" width="18" height="55" fill="#f59e0b"/> <!-- X-axis labels --> <text x="79" y="196" text-anchor="middle" font-size="7" fill="#334155">DTree</text> <text x="169" y="196" text-anchor="middle" font-size="7" fill="#334155">RF</text> <text x="259" y="196" text-anchor="middle" font-size="7" fill="#334155">AdaBoost</text> <text x="349" y="196" text-anchor="middle" font-size="7" fill="#334155">GB</text> <text x="439" y="196" text-anchor="middle" font-size="7" fill="#334155">XGBoost</text> <!-- Values on bars --> <text x="89" y="80" text-anchor="middle" font-size="7" fill="#f59e0b">0.731</text> <text x="179" y="110" text-anchor="middle" font-size="7" fill="#f59e0b">0.503</text> <text x="269" y="95" text-anchor="middle" font-size="7" fill="#f59e0b">0.619</text> <text x="359" y="117" text-anchor="middle" font-size="7" fill="#f59e0b">0.451</text> <text x="449" y="123" text-anchor="middle" font-size="7" fill="#f59e0b" font-weight="bold">0.413</text> <!-- Legend --> <rect x="430" y="30" width="10" height="8" fill="#3b82f6"/> <text x="443" y="38" font-size="8" fill="#334155">Train</text> <rect x="430" y="44" width="10" height="8" fill="#f59e0b"/> <text x="443" y="52" font-size="8" fill="#334155">Test</text>

Decision Tree: maximal overfitting gap (Train=0, Test=0.731). AdaBoost underperforms GB and XGBoost — AdaBoost with stumps is primarily a bias reducer, and California Housing's relationship is complex enough to benefit more from deeper trees with regularization. XGBoost wins test RMSE (0.413, R²=0.830).

When to Choose Which Method

ScenarioRecommendedReason
Quick baseline, interpretableRandom ForestNo tuning needed; OOB score for free
Clean data, high bias problemAdaBoostFast sequential correction of systematic errors
Medium dataset, best accuracyGradient BoostingWell-regularized, well-studied behavior
Large dataset (>100k rows), best accuracyXGBoost / LightGBMHistogram speed + built-in L1/L2 regularization
Real-time inference neededRandom ForestSmaller model, no sequential dependency
Noisy data / many outliersRandom ForestAveraging dilutes outlier impact; XGBoost sensitive

Test Your Understanding

  1. XGBoost RMSE=0.413, Gradient Boosting RMSE=0.451 — a 8% improvement. Both use max_depth=3–4, learning_rate=0.05–0.1, n_estimators=200–500. What specifically causes the gap? List at least 2 algorithmic differences between XGBoost and sklearn's GradientBoostingRegressor that could explain the improvement.

  2. Early stopping stopped at round 549, with best iteration 452 (50 rounds after best). If you set early_stopping_rounds=20 instead of 50, the model might stop earlier — say at round 472. Would this give a better or worse final model? When does smaller early_stopping_rounds help and when does it hurt?

  3. XGBClassifier achieved AUC=0.9972, while AdaBoost achieved AUC=0.9923 on Breast Cancer. Both are boosting methods. What difference in the two methods (beyond the obvious "XGBoost has more hyperparameters") most likely explains the AUC gap on this specific dataset?

  4. The feature importance comparison: Gain importance measures average gain per split; Weight counts total splits. A feature used 200 times at depth-10 (small gain per split) vs a feature used 3 times at depth-1 (large gain per split). Which gets higher Weight importance? Which gets higher Gain importance? For feature selection, which should you use?

  5. The final table shows AdaBoost Train RMSE=0.623 — higher than Random Forest (0.102). Both are ensemble methods. Why does AdaBoost have non-zero training error while Random Forest drives training error toward zero? What property of AdaBoost's stumps prevents it from memorizing training data even with n_estimators=200?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment