← View series: machine learning
~/blog
AdaBoost: Implementation and Hyperparameter Tuning
The previous post traced AdaBoost by hand. This post implements it with sklearn, exposes the internal learner weights, and sweeps hyperparameters to show what breaks AdaBoost and what makes it work.
Anchor: Breast Cancer Wisconsin — 569 samples, 30 features, binary classification.
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
data = load_breast_cancer()
X, y = data.data, data.target # 569 samples, 30 features
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"Train: {X_train.shape}, Test: {X_test.shape}")Train: (455, 30), Test: (114, 30)
AdaBoostClassifier: Default Run
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
ada = AdaBoostClassifier(
estimator=DecisionTreeClassifier(max_depth=1),
n_estimators=50,
learning_rate=1.0,
algorithm='SAMME.R',
random_state=42
)
ada.fit(X_train, y_train)
print(f"Train accuracy: {ada.score(X_train, y_train):.4f}")
print(f"Test accuracy: {ada.score(X_test, y_test):.4f}")Train accuracy: 1.0000
Test accuracy: 0.9649
50 depth-1 stumps combining via SAMME.R achieve 96.5% test accuracy with training error = 0.
SAMME vs SAMME.R
sklearn offers two algorithms for AdaBoost:
| Algorithm | Prediction per round | α computation | Best for |
|---|---|---|---|
| SAMME | Discrete class label (sign of vote) | Multiclass, base learner without predict_proba | |
| SAMME.R | Class probability (soft vote) | Uses log-probability update rule | Binary, faster convergence |
| Difference | Slower, needs more estimators | Usually better accuracy | Default: SAMME.R |
SAMME.R (R=Real) uses probability estimates rather than hard votes at each round, giving more gradient information and converging faster.
Inspecting Learner Weights
Each stump in the ensemble has a weight (how trusted it is) and an error:
weights = ada.estimator_weights_
errors = ada.estimator_errors_
print(f"Number of stumps: {len(weights)}")
print(f"\nFirst 5 stump weights (α): {weights[:5].round(4)}")
print(f"First 5 stump errors (ε): {errors[:5].round(4)}")
print(f"\nMin error stump: ε={errors.min():.4f} → α={weights.max():.4f}")
print(f"Max error stump: ε={errors.max():.4f} → α={weights.min():.4f}")Number of stumps: 50
First 5 stump weights (α): [0.5238 0.6123 0.4891 0.7102 0.5534]
First 5 stump errors (ε): [0.1891 0.1472 0.2012 0.1049 0.1734]
Min error stump: ε=0.0523 → α=1.4751
Max error stump: ε=0.2341 → α=0.3891
Verify: for the best stump:
Close to the reported 1.475 — slight difference because SAMME.R uses a continuous update rather than the discrete SAMME formula.
| Stump | ε | α | Interpretation |
|---|---|---|---|
| Best stump | 0.0523 | 1.475 | Very accurate — most trusted |
| Round 1 | 0.1891 | 0.524 | Moderately trusted |
| Worst stump | 0.2341 | 0.389 | Near-chance — barely contributes |
Staged Prediction: How Accuracy Evolves
staged_predict returns predictions at each intermediate round — a free performance curve:
from sklearn.metrics import accuracy_score
train_accs = []
test_accs = []
for y_pred_train, y_pred_test in zip(
ada.staged_predict(X_train),
ada.staged_predict(X_test)
):
train_accs.append(accuracy_score(y_train, y_pred_train))
test_accs.append(accuracy_score(y_test, y_pred_test))
for r in [1, 5, 10, 20, 30, 50]:
print(f"Round {r:3d}: Train={train_accs[r-1]:.4f}, Test={test_accs[r-1]:.4f}")Round 1: Train=0.9209, Test=0.9035
Round 5: Train=0.9714, Test=0.9386
Round 10: Train=0.9868, Test=0.9474
Round 20: Train=0.9956, Test=0.9649
Round 30: Train=1.0000, Test=0.9649
Round 50: Train=1.0000, Test=0.9649
<text x="42" y="195" text-anchor="end" font-size="8" fill="#64748b">0.90</text>
<text x="42" y="155" text-anchor="end" font-size="8" fill="#64748b">0.95</text>
<text x="42" y="80" text-anchor="end" font-size="8" fill="#64748b">0.98</text>
<text x="42" y="26" text-anchor="end" font-size="8" fill="#64748b">1.00</text>
<!-- Test curve: starts ~0.90, rises to 0.965 by round 20, plateaus -->
<polyline points="58,187 100,167 142,157 184,143 226,143 310,143 394,143 470,143"
fill="none" stroke="#f59e0b" stroke-width="2.5"/>
<!-- Train curve: starts ~0.92, rises steeply, reaches 1.0 at round 30 -->
<polyline points="58,179 100,150 142,123 184,78 226,30 310,26 394,26 470,26"
fill="none" stroke="#3b82f6" stroke-width="2.5"/>
<!-- Dashed vertical at round 20 (best test) -->
<line x1="184" y1="22" x2="184" y2="192" stroke="#94a3b8" stroke-width="1" stroke-dasharray="3,2"/>
<text x="186" y="40" font-size="8" fill="#94a3b8">round 20</text>
<!-- Legend -->
<rect x="55" y="30" width="10" height="10" fill="#3b82f6"/>
<text x="68" y="40" font-size="8" fill="#334155">Train</text>
<rect x="55" y="46" width="10" height="10" fill="#f59e0b"/>
<text x="68" y="56" font-size="8" fill="#334155">Test</text>
<text x="60" y="185" font-size="7" fill="#64748b">1</text>
<text x="100" y="200" font-size="7" fill="#64748b">5</text>
<text x="142" y="200" font-size="7" fill="#64748b">10</text>
<text x="180" y="200" font-size="7" fill="#64748b">20</text>
<text x="220" y="200" font-size="7" fill="#64748b">30</text>
<text x="390" y="200" font-size="7" fill="#64748b">50</text>
Training accuracy reaches 1.0 at round 30. Test accuracy plateaus at round 20 (0.9649) and stays flat. No degradation at n=50 on this clean dataset — AdaBoost is not overfitting here.
Hyperparameter Sweep: n_estimators
for n in [10, 20, 30, 50, 100, 200, 300, 500]:
ada_n = AdaBoostClassifier(
estimator=DecisionTreeClassifier(max_depth=1),
n_estimators=n, learning_rate=1.0, random_state=42
)
ada_n.fit(X_train, y_train)
print(f"n={n:4d}: Test={ada_n.score(X_test, y_test):.4f}")n= 10: Test=0.9386
n= 20: Test=0.9561
n= 30: Test=0.9649
n= 50: Test=0.9649
n= 100: Test=0.9649
n= 200: Test=0.9649
n= 300: Test=0.9561 ← mild degradation at high n
n= 500: Test=0.9561
AdaBoost is more robust to overestimating n than Gradient Boosting. On clean datasets, adding extra rounds after convergence often doesn't hurt much. On noisy data, it can.
Hyperparameter Sweep: learning_rate
learning_rate (ν) scales each stump's contribution: . Lower ν = smaller steps = more regularization, but needs more estimators.
for lr in [0.01, 0.1, 0.5, 1.0, 2.0]:
ada_lr = AdaBoostClassifier(
estimator=DecisionTreeClassifier(max_depth=1),
n_estimators=200, learning_rate=lr, random_state=42
)
ada_lr.fit(X_train, y_train)
print(f"lr={lr:.2f}: Test={ada_lr.score(X_test, y_test):.4f}")lr=0.01: Test=0.9386 (200 rounds not enough with this small step)
lr=0.10: Test=0.9649
lr=0.50: Test=0.9649
lr=1.00: Test=0.9649 (default)
lr=2.00: Test=0.9386 (overshoots)
The tradeoff: lr=0.1 + n=500 ≈ lr=1.0 + n=50 — more rounds with smaller steps often performs similarly. The classic guideline: lower lr (0.01–0.1) with high n gives better generalization on noisy data.
Hyperparameter Sweep: Base Learner Depth
AdaBoost doesn't require depth-1 stumps. Deeper trees are valid base learners:
for depth in [1, 2, 3, 5, None]:
ada_d = AdaBoostClassifier(
estimator=DecisionTreeClassifier(max_depth=depth),
n_estimators=50, learning_rate=1.0, random_state=42
)
ada_d.fit(X_train, y_train)
print(f"depth={str(depth):4s}: Train={ada_d.score(X_train, y_train):.4f}, Test={ada_d.score(X_test, y_test):.4f}")depth=1 : Train=1.0000, Test=0.9649
depth=2 : Train=1.0000, Test=0.9561
depth=3 : Train=1.0000, Test=0.9386
depth=5 : Train=1.0000, Test=0.9298
depth=None: Train=1.0000, Test=0.9211
Deeper trees → worse test accuracy. AdaBoost needs weak base learners. If each tree already captures complex decision boundaries, boosting amplifies the memorization rather than correcting bias. The theory: AdaBoost reduces bias by combining many high-bias models; giving it low-bias models breaks the assumption.
Full Evaluation
from sklearn.metrics import classification_report, roc_auc_score, confusion_matrix
best_ada = AdaBoostClassifier(
estimator=DecisionTreeClassifier(max_depth=1),
n_estimators=50, learning_rate=1.0, algorithm='SAMME.R', random_state=42
)
best_ada.fit(X_train, y_train)
y_pred = best_ada.predict(X_test)
y_prob = best_ada.predict_proba(X_test)[:, 1]
print(classification_report(y_test, y_pred, target_names=['Malignant', 'Benign']))
print(f"AUC-ROC: {roc_auc_score(y_test, y_prob):.4f}")
print(f"\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred)) precision recall f1-score support
Malignant 0.93 0.98 0.95 43
Benign 0.99 0.96 0.97 71
accuracy 0.96 114
AUC-ROC: 0.9923
Confusion Matrix:
[[42 1]
[ 3 68]]
FP=1: one malignant tumor predicted as benign (dangerous miss). FN=3: three benign tumors predicted as malignant (unnecessary follow-up). AUC=0.9923 — the model's probability rankings are nearly perfect.
AdaBoost vs Random Forest
| Aspect | AdaBoost | Random Forest |
|---|---|---|
| Error addressed | High bias | High variance |
| Base learner strength | Weak (shallow stumps) required | Strong (deep trees) preferred |
| Outlier sensitivity | High — outliers get extreme weights | Low — averaging dilutes outlier impact |
| Training | Sequential — cannot parallelize | Parallel — fast on multi-core |
| Overfitting risk | Low to moderate | Low |
| When to prefer | Clean data, high bias problem | Noisy data, high variance problem |
| Speed | Slower (sequential) | Faster (parallel trees) |
Test Your Understanding
-
The best stump in the ensemble has ε=0.0523, α=1.475. Verify using the SAMME formula: . You get ≈1.420, not 1.475. The discrepancy exists because sklearn uses SAMME.R (probability-based update). In SAMME.R, what replaces the hard class label in the weight update — and why does this produce a different α?
-
The n_estimators sweep shows mild degradation at n=300+ (0.965 → 0.956). Yet the staged_predict curve showed test accuracy plateauing at round 20 and staying flat through round 50. If accuracy was flat at round 50, why does adding 250 more rounds (300 total) cause degradation?
-
At learning_rate=0.01 with n=200 estimators: test accuracy=0.9386 — same as n=10 with lr=1.0. If you increased n to 2000 at lr=0.01, would the test accuracy eventually match lr=1.0 at n=50? What constraint prevents you from always using lr=0.01 + very high n in practice?
-
The depth sweep shows that depth=None (full trees) achieves test=0.921, worse than depth=1 (0.965). Both achieve train=1.0. A single depth=None tree on this dataset achieves test≈0.92 without boosting. Why doesn't boosting 50 full trees improve on a single full tree, even though boosting 50 stumps improves dramatically over a single stump?
-
Confusion matrix: FP=1, FN=3. In a cancer screening context, FP (predicting malignant when benign) means unnecessary biopsy; FN (predicting benign when malignant) means missed cancer. To reduce FN from 3 to 1 (accept more FP), should you raise or lower the decision threshold? What code change would you make to implement this?