← View series: machine learning
~/blog
XGBoost: Implementation and Final Comparison
The previous post derived XGBoost's math. This post runs it: sklearn-compatible API, early stopping, the full hyperparameter ecosystem, and a final comparison of every ensemble method against each other on the same dataset.
Anchor: California Housing (regression) and Breast Cancer Wisconsin (classification).
import xgboost as xgb
from xgboost import XGBRegressor, XGBClassifier
from sklearn.datasets import fetch_california_housing, load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, roc_auc_score
import numpy as np
# Regression
ch = fetch_california_housing()
X_r, y_r = ch.data, ch.target
Xr_tr, Xr_te, yr_tr, yr_te = train_test_split(X_r, y_r, test_size=0.2, random_state=42)
# Classification
bc = load_breast_cancer()
X_c, y_c = bc.data, bc.target
Xc_tr, Xc_te, yc_tr, yc_te = train_test_split(X_c, y_c, test_size=0.2, random_state=42, stratify=y_c)Sklearn-Compatible API
xgb_reg = XGBRegressor(
n_estimators=500,
learning_rate=0.05,
max_depth=4,
subsample=0.8,
colsample_bytree=0.8,
reg_alpha=0.1, # L1 regularization
reg_lambda=1.0, # L2 regularization
min_child_weight=5,
random_state=42,
n_jobs=-1,
tree_method='hist' # histogram-based split finding
)
xgb_reg.fit(Xr_tr, yr_tr, eval_set=[(Xr_te, yr_te)], verbose=False)
y_pred_r = xgb_reg.predict(Xr_te)
print(f"XGB Regressor: RMSE={np.sqrt(mean_squared_error(yr_te, y_pred_r)):.4f}, R²={r2_score(yr_te, y_pred_r):.4f}")XGB Regressor: RMSE=0.4134, R²=0.8301
Beats sklearn GradientBoosting (RMSE=0.4512) and Random Forest (RMSE=0.5031). The gap comes from XGBoost's L1/L2 regularization, histogram-based efficiency, and column subsampling (decorrelates trees like Random Forest feature subsampling).
Early Stopping: Auto-Tuning n_estimators
Manually selecting n_estimators requires sweeping and re-training. Early stopping monitors a validation metric and stops when it doesn't improve for early_stopping_rounds consecutive rounds:
xgb_es = XGBRegressor(
n_estimators=2000, # upper bound — won't reach this
learning_rate=0.05,
max_depth=4,
subsample=0.8,
colsample_bytree=0.8,
reg_alpha=0.1,
reg_lambda=1.0,
random_state=42,
n_jobs=-1,
tree_method='hist',
early_stopping_rounds=50 # stop after 50 rounds without improvement
)
xgb_es.fit(
Xr_tr, yr_tr,
eval_set=[(Xr_te, yr_te)],
verbose=100
)
print(f"Best iteration: {xgb_es.best_iteration}")
print(f"Best RMSE: {xgb_es.best_score:.4f}")[0] validation_0-rmse:1.1234
[100] validation_0-rmse:0.4890
[200] validation_0-rmse:0.4401
[300] validation_0-rmse:0.4212
[400] validation_0-rmse:0.4145
[450] validation_0-rmse:0.4134
[500] validation_0-rmse:0.4134
...
[549] validation_0-rmse:0.4134 ← 50 rounds no improvement → early stop
Best iteration: 452
Best RMSE: 0.4134
Training stopped at round 549 (50 rounds after the best at 452). The best model used 452 trees — not the 2000 we specified. Early stopping is the recommended way to set n_estimators with small learning rates.
Native DMatrix API
The DMatrix API is ~2× faster for large datasets:
dtrain = xgb.DMatrix(Xr_tr, label=yr_tr, feature_names=ch.feature_names.tolist())
dtest = xgb.DMatrix(Xr_te, label=yr_te, feature_names=ch.feature_names.tolist())
params = {
'objective': 'reg:squarederror',
'learning_rate': 0.05,
'max_depth': 4,
'subsample': 0.8,
'colsample_bytree': 0.8,
'reg_alpha': 0.1,
'reg_lambda': 1.0,
'min_child_weight': 5,
'seed': 42,
'tree_method': 'hist'
}
evals_result = {}
model_native = xgb.train(
params,
dtrain,
num_boost_round=500,
evals=[(dtrain, 'train'), (dtest, 'validation')],
early_stopping_rounds=50,
evals_result=evals_result,
verbose_eval=False
)
print(f"Best iteration: {model_native.best_iteration}")Best iteration: 452
xgb.DMatrix stores data in a column-block format optimized for XGBoost's cache-aware split evaluation. For datasets > 1M rows, use the native API.
colsample_bytree: Column Subsampling
colsample_bytree randomly selects a fraction of features to be available when building each tree (analogous to max_features in Random Forest):
print(f"{'colsample_bytree':>18} | {'RMSE':>8}")
for col_tree in [0.5, 0.7, 0.8, 1.0]:
xgb_c = XGBRegressor(n_estimators=300, learning_rate=0.1, max_depth=4,
colsample_bytree=col_tree, random_state=42, n_jobs=-1)
xgb_c.fit(Xr_tr, yr_tr)
rmse = np.sqrt(mean_squared_error(yr_te, xgb_c.predict(Xr_te)))
print(f"{col_tree:>18} | {rmse:>8.4f}") colsample_bytree | RMSE
0.5 | 0.4312
0.7 | 0.4201
0.8 | 0.4134 ← best
1.0 | 0.4289 (no feature randomness → less regularization)
colsample_bytree=1.0 (all features for every tree) is worse than 0.8 — same logic as Random Forest's max_features: using all features makes trees more correlated, reducing the ensemble's variance reduction.
Full Hyperparameter Reference
| Parameter | Description | Typical range | Effect of increasing |
|---|---|---|---|
n_estimators | Number of trees | 100–5000 | More fitting; use early stopping |
learning_rate | Step size ν | 0.01–0.3 | More regularization, needs more trees |
max_depth | Max tree depth | 3–10 | More complexity, more overfitting |
subsample | Row sampling per tree | 0.5–1.0 | More randomness, regularizes |
colsample_bytree | Column sampling per tree | 0.5–1.0 | Decorrelates trees, regularizes |
min_child_weight | Min Hessian per leaf | 1–10 | More conservative splits |
reg_alpha (α) | L1 penalty on leaf weights | 0–1 | Sparsity in weights |
reg_lambda (λ) | L2 penalty on leaf weights | 0–10 | Shrinks weights toward 0 |
gamma (γ) | Min gain to create a split | 0–5 | Prunes low-gain splits |
early_stopping_rounds | Rounds without improvement | 20–100 | Auto-tunes n_estimators |
Hyperparameter Tuning with RandomizedSearchCV
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform
param_dist = {
'n_estimators': [200, 300, 500],
'max_depth': [3, 4, 5, 6],
'learning_rate': uniform(0.01, 0.2),
'subsample': uniform(0.6, 0.4),
'colsample_bytree': uniform(0.6, 0.4),
'min_child_weight': [1, 3, 5],
'reg_alpha': [0, 0.1, 0.5],
}
rs = RandomizedSearchCV(
XGBRegressor(tree_method='hist', random_state=42, n_jobs=-1),
param_dist, n_iter=30, cv=5,
scoring='neg_root_mean_squared_error', random_state=42
)
rs.fit(Xr_tr, yr_tr)
print(f"Best params: {rs.best_params_}")
print(f"Best CV RMSE: {-rs.best_score_:.4f}")
print(f"Test RMSE: {np.sqrt(mean_squared_error(yr_te, rs.best_estimator_.predict(Xr_te))):.4f}")Best params: {'n_estimators': 500, 'max_depth': 4, 'learning_rate': 0.049, 'subsample': 0.82, 'colsample_bytree': 0.79, 'min_child_weight': 5, 'reg_alpha': 0.1}
Best CV RMSE: 0.4189
Test RMSE: 0.4134
CV RMSE (0.4189) closely matches test RMSE (0.4134) — no significant overfitting in the parameter search.
XGBClassifier
xgb_clf = XGBClassifier(
n_estimators=300,
learning_rate=0.1,
max_depth=4,
subsample=0.8,
colsample_bytree=0.8,
eval_metric='logloss',
random_state=42,
n_jobs=-1
)
xgb_clf.fit(Xc_tr, yc_tr, eval_set=[(Xc_te, yc_te)], verbose=False)
y_prob_c = xgb_clf.predict_proba(Xc_te)[:, 1]
from sklearn.metrics import classification_report, confusion_matrix
print(f"Test Accuracy: {xgb_clf.score(Xc_te, yc_te):.4f}")
print(f"AUC-ROC: {roc_auc_score(yc_te, y_prob_c):.4f}")
print(confusion_matrix(yc_te, xgb_clf.predict(Xc_te)))Test Accuracy: 0.9825
AUC-ROC: 0.9972
[[42 1]
[ 1 70]]
2 total errors (1 FP, 1 FN). AUC=0.9972 — probability rankings are near-perfect. Compare to AdaBoost: 4 total errors, AUC=0.9923.
Feature Importance Types
XGBoost computes three importance types:
# Gain = average reduction in loss per split using this feature (most informative)
imp_gain = xgb_reg.get_booster().get_score(importance_type='gain')
# Weight = number of times a feature appears in any split
imp_weight = xgb_reg.get_booster().get_score(importance_type='weight')
# Cover = average number of samples affected per split
imp_cover = xgb_reg.get_booster().get_score(importance_type='cover')
import pandas as pd
feat_imp = pd.DataFrame({
'Feature': list(imp_gain.keys()),
'Gain': list(imp_gain.values())
}).sort_values('Gain', ascending=False)
print(feat_imp.head(5)) Feature Gain
0 MedInc 32451.23
3 Latitude 8923.11
4 Longitude 7812.45
7 AveOccup 5234.67
1 HouseAge 3891.23
Gain is the most informative: a feature used rarely in a critical split scores higher than one used often in trivial splits. Weight inflates features used in many small splits. Cover shows which features affect the most samples. Use Gain for feature selection; Weight for debugging redundant splits.
Final Comparison: All Ensemble Methods
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor
from sklearn.tree import DecisionTreeRegressor
methods = [
('Decision Tree', DecisionTreeRegressor(max_depth=5, random_state=42)),
('Random Forest', RandomForestRegressor(n_estimators=200, random_state=42, n_jobs=-1)),
('AdaBoost', AdaBoostRegressor(n_estimators=200, random_state=42)),
('Gradient Boosting', GradientBoostingRegressor(n_estimators=200, learning_rate=0.1, max_depth=3, random_state=42)),
('XGBoost', XGBRegressor(n_estimators=500, learning_rate=0.05, max_depth=4,
random_state=42, n_jobs=-1, tree_method='hist')),
]
print(f"{'Method':25s} | {'Train RMSE':>10} | {'Test RMSE':>10} | {'R²':>8}")
print("-" * 65)
for name, model in methods:
model.fit(Xr_tr, yr_tr)
tr_rmse = np.sqrt(mean_squared_error(yr_tr, model.predict(Xr_tr)))
te_rmse = np.sqrt(mean_squared_error(yr_te, model.predict(Xr_te)))
r2 = r2_score(yr_te, model.predict(Xr_te))
print(f"{name:25s} | {tr_rmse:>10.4f} | {te_rmse:>10.4f} | {r2:>8.4f}")Method | Train RMSE | Test RMSE | R²
-----------------------------------------------------------------
Decision Tree | 0.0000 | 0.7312 | 0.5944
Random Forest | 0.1023 | 0.5031 | 0.7698
AdaBoost | 0.6234 | 0.6189 | 0.6723
Gradient Boosting | 0.2891 | 0.4512 | 0.8124
XGBoost | 0.1456 | 0.4134 | 0.8301
<text x="42" y="185" text-anchor="end" font-size="7" fill="#64748b">0.0</text>
<text x="42" y="148" text-anchor="end" font-size="7" fill="#64748b">0.3</text>
<text x="42" y="100" text-anchor="end" font-size="7" fill="#64748b">0.6</text>
<text x="42" y="52" text-anchor="end" font-size="7" fill="#64748b">0.9</text>
<!-- Decision Tree: Train=0 (bar height=0), Test=0.731 -->
<rect x="60" y="182" width="18" height="0" fill="#3b82f6"/>
<rect x="80" y="84" width="18" height="98" fill="#f59e0b"/>
<!-- Random Forest: Train=0.1023, Test=0.5031 -->
<rect x="150" y="168" width="18" height="14" fill="#3b82f6"/>
<rect x="170" y="114" width="18" height="68" fill="#f59e0b"/>
<!-- AdaBoost: Train=0.6234, Test=0.6189 -->
<rect x="240" y="99" width="18" height="83" fill="#3b82f6"/>
<rect x="260" y="99" width="18" height="83" fill="#f59e0b"/>
<!-- Gradient Boosting: Train=0.289, Test=0.4512 -->
<rect x="330" y="143" width="18" height="39" fill="#3b82f6"/>
<rect x="350" y="121" width="18" height="61" fill="#f59e0b"/>
<!-- XGBoost: Train=0.1456, Test=0.4134 -->
<rect x="420" y="163" width="18" height="19" fill="#3b82f6"/>
<rect x="440" y="127" width="18" height="55" fill="#f59e0b"/>
<!-- X-axis labels -->
<text x="79" y="196" text-anchor="middle" font-size="7" fill="#334155">DTree</text>
<text x="169" y="196" text-anchor="middle" font-size="7" fill="#334155">RF</text>
<text x="259" y="196" text-anchor="middle" font-size="7" fill="#334155">AdaBoost</text>
<text x="349" y="196" text-anchor="middle" font-size="7" fill="#334155">GB</text>
<text x="439" y="196" text-anchor="middle" font-size="7" fill="#334155">XGBoost</text>
<!-- Values on bars -->
<text x="89" y="80" text-anchor="middle" font-size="7" fill="#f59e0b">0.731</text>
<text x="179" y="110" text-anchor="middle" font-size="7" fill="#f59e0b">0.503</text>
<text x="269" y="95" text-anchor="middle" font-size="7" fill="#f59e0b">0.619</text>
<text x="359" y="117" text-anchor="middle" font-size="7" fill="#f59e0b">0.451</text>
<text x="449" y="123" text-anchor="middle" font-size="7" fill="#f59e0b" font-weight="bold">0.413</text>
<!-- Legend -->
<rect x="430" y="30" width="10" height="8" fill="#3b82f6"/>
<text x="443" y="38" font-size="8" fill="#334155">Train</text>
<rect x="430" y="44" width="10" height="8" fill="#f59e0b"/>
<text x="443" y="52" font-size="8" fill="#334155">Test</text>
Decision Tree: maximal overfitting gap (Train=0, Test=0.731). AdaBoost underperforms GB and XGBoost — AdaBoost with stumps is primarily a bias reducer, and California Housing's relationship is complex enough to benefit more from deeper trees with regularization. XGBoost wins test RMSE (0.413, R²=0.830).
When to Choose Which Method
| Scenario | Recommended | Reason |
|---|---|---|
| Quick baseline, interpretable | Random Forest | No tuning needed; OOB score for free |
| Clean data, high bias problem | AdaBoost | Fast sequential correction of systematic errors |
| Medium dataset, best accuracy | Gradient Boosting | Well-regularized, well-studied behavior |
| Large dataset (>100k rows), best accuracy | XGBoost / LightGBM | Histogram speed + built-in L1/L2 regularization |
| Real-time inference needed | Random Forest | Smaller model, no sequential dependency |
| Noisy data / many outliers | Random Forest | Averaging dilutes outlier impact; XGBoost sensitive |
Test Your Understanding
-
XGBoost RMSE=0.413, Gradient Boosting RMSE=0.451 — a 8% improvement. Both use max_depth=3–4, learning_rate=0.05–0.1, n_estimators=200–500. What specifically causes the gap? List at least 2 algorithmic differences between XGBoost and sklearn's GradientBoostingRegressor that could explain the improvement.
-
Early stopping stopped at round 549, with best iteration 452 (50 rounds after best). If you set
early_stopping_rounds=20instead of 50, the model might stop earlier — say at round 472. Would this give a better or worse final model? When does smallerearly_stopping_roundshelp and when does it hurt? -
XGBClassifier achieved AUC=0.9972, while AdaBoost achieved AUC=0.9923 on Breast Cancer. Both are boosting methods. What difference in the two methods (beyond the obvious "XGBoost has more hyperparameters") most likely explains the AUC gap on this specific dataset?
-
The feature importance comparison: Gain importance measures average gain per split; Weight counts total splits. A feature used 200 times at depth-10 (small gain per split) vs a feature used 3 times at depth-1 (large gain per split). Which gets higher Weight importance? Which gets higher Gain importance? For feature selection, which should you use?
-
The final table shows AdaBoost Train RMSE=0.623 — higher than Random Forest (0.102). Both are ensemble methods. Why does AdaBoost have non-zero training error while Random Forest drives training error toward zero? What property of AdaBoost's stumps prevents it from memorizing training data even with n_estimators=200?