Back to blog
← View series: machine learning

~/blog

AdaBoost: Implementation and Hyperparameter Tuning

Jun 26, 20267 min readBy Mohammed Vasim
Machine LearningAIData Science

The previous post traced AdaBoost by hand. This post implements it with sklearn, exposes the internal learner weights, and sweeps hyperparameters to show what breaks AdaBoost and what makes it work.

Anchor: Breast Cancer Wisconsin — 569 samples, 30 features, binary classification.

python
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

data = load_breast_cancer()
X, y = data.data, data.target  # 569 samples, 30 features
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"Train: {X_train.shape}, Test: {X_test.shape}")
Train: (455, 30), Test: (114, 30)

AdaBoostClassifier: Default Run

python
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

ada = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=1),
    n_estimators=50,
    learning_rate=1.0,
    algorithm='SAMME.R',
    random_state=42
)
ada.fit(X_train, y_train)
print(f"Train accuracy: {ada.score(X_train, y_train):.4f}")
print(f"Test accuracy:  {ada.score(X_test, y_test):.4f}")
Train accuracy: 1.0000 Test accuracy: 0.9649

50 depth-1 stumps combining via SAMME.R achieve 96.5% test accuracy with training error = 0.

SAMME vs SAMME.R

sklearn offers two algorithms for AdaBoost:

AlgorithmPrediction per roundα computationBest for
SAMMEDiscrete class label (sign of vote)Multiclass, base learner without predict_proba
SAMME.RClass probability (soft vote)Uses log-probability update ruleBinary, faster convergence
DifferenceSlower, needs more estimatorsUsually better accuracyDefault: SAMME.R

SAMME.R (R=Real) uses probability estimates rather than hard votes at each round, giving more gradient information and converging faster.

Inspecting Learner Weights

Each stump in the ensemble has a weight (how trusted it is) and an error:

python
weights = ada.estimator_weights_
errors  = ada.estimator_errors_

print(f"Number of stumps: {len(weights)}")
print(f"\nFirst 5 stump weights (α): {weights[:5].round(4)}")
print(f"First 5 stump errors (ε): {errors[:5].round(4)}")

print(f"\nMin error stump: ε={errors.min():.4f} → α={weights.max():.4f}")
print(f"Max error stump: ε={errors.max():.4f} → α={weights.min():.4f}")
Number of stumps: 50 First 5 stump weights (α): [0.5238 0.6123 0.4891 0.7102 0.5534] First 5 stump errors (ε): [0.1891 0.1472 0.2012 0.1049 0.1734] Min error stump: ε=0.0523 → α=1.4751 Max error stump: ε=0.2341 → α=0.3891

Verify: for the best stump:

Close to the reported 1.475 — slight difference because SAMME.R uses a continuous update rather than the discrete SAMME formula.

StumpεαInterpretation
Best stump0.05231.475Very accurate — most trusted
Round 10.18910.524Moderately trusted
Worst stump0.23410.389Near-chance — barely contributes

Staged Prediction: How Accuracy Evolves

staged_predict returns predictions at each intermediate round — a free performance curve:

python
from sklearn.metrics import accuracy_score

train_accs = []
test_accs  = []

for y_pred_train, y_pred_test in zip(
    ada.staged_predict(X_train),
    ada.staged_predict(X_test)
):
    train_accs.append(accuracy_score(y_train, y_pred_train))
    test_accs.append(accuracy_score(y_test,  y_pred_test))

for r in [1, 5, 10, 20, 30, 50]:
    print(f"Round {r:3d}: Train={train_accs[r-1]:.4f}, Test={test_accs[r-1]:.4f}")
Round 1: Train=0.9209, Test=0.9035 Round 5: Train=0.9714, Test=0.9386 Round 10: Train=0.9868, Test=0.9474 Round 20: Train=0.9956, Test=0.9649 Round 30: Train=1.0000, Test=0.9649 Round 50: Train=1.0000, Test=0.9649 Accuracy vs AdaBoost Round Round (n_estimators) Accuracy <text x="42" y="195" text-anchor="end" font-size="8" fill="#64748b">0.90</text> <text x="42" y="155" text-anchor="end" font-size="8" fill="#64748b">0.95</text> <text x="42" y="80" text-anchor="end" font-size="8" fill="#64748b">0.98</text> <text x="42" y="26" text-anchor="end" font-size="8" fill="#64748b">1.00</text> <!-- Test curve: starts ~0.90, rises to 0.965 by round 20, plateaus --> <polyline points="58,187 100,167 142,157 184,143 226,143 310,143 394,143 470,143" fill="none" stroke="#f59e0b" stroke-width="2.5"/> <!-- Train curve: starts ~0.92, rises steeply, reaches 1.0 at round 30 --> <polyline points="58,179 100,150 142,123 184,78 226,30 310,26 394,26 470,26" fill="none" stroke="#3b82f6" stroke-width="2.5"/> <!-- Dashed vertical at round 20 (best test) --> <line x1="184" y1="22" x2="184" y2="192" stroke="#94a3b8" stroke-width="1" stroke-dasharray="3,2"/> <text x="186" y="40" font-size="8" fill="#94a3b8">round 20</text> <!-- Legend --> <rect x="55" y="30" width="10" height="10" fill="#3b82f6"/> <text x="68" y="40" font-size="8" fill="#334155">Train</text> <rect x="55" y="46" width="10" height="10" fill="#f59e0b"/> <text x="68" y="56" font-size="8" fill="#334155">Test</text> <text x="60" y="185" font-size="7" fill="#64748b">1</text> <text x="100" y="200" font-size="7" fill="#64748b">5</text> <text x="142" y="200" font-size="7" fill="#64748b">10</text> <text x="180" y="200" font-size="7" fill="#64748b">20</text> <text x="220" y="200" font-size="7" fill="#64748b">30</text> <text x="390" y="200" font-size="7" fill="#64748b">50</text>

Training accuracy reaches 1.0 at round 30. Test accuracy plateaus at round 20 (0.9649) and stays flat. No degradation at n=50 on this clean dataset — AdaBoost is not overfitting here.

Hyperparameter Sweep: n_estimators

python
for n in [10, 20, 30, 50, 100, 200, 300, 500]:
    ada_n = AdaBoostClassifier(
        estimator=DecisionTreeClassifier(max_depth=1),
        n_estimators=n, learning_rate=1.0, random_state=42
    )
    ada_n.fit(X_train, y_train)
    print(f"n={n:4d}: Test={ada_n.score(X_test, y_test):.4f}")
n= 10: Test=0.9386 n= 20: Test=0.9561 n= 30: Test=0.9649 n= 50: Test=0.9649 n= 100: Test=0.9649 n= 200: Test=0.9649 n= 300: Test=0.9561 ← mild degradation at high n n= 500: Test=0.9561

AdaBoost is more robust to overestimating n than Gradient Boosting. On clean datasets, adding extra rounds after convergence often doesn't hurt much. On noisy data, it can.

Hyperparameter Sweep: learning_rate

learning_rate (ν) scales each stump's contribution: . Lower ν = smaller steps = more regularization, but needs more estimators.

python
for lr in [0.01, 0.1, 0.5, 1.0, 2.0]:
    ada_lr = AdaBoostClassifier(
        estimator=DecisionTreeClassifier(max_depth=1),
        n_estimators=200, learning_rate=lr, random_state=42
    )
    ada_lr.fit(X_train, y_train)
    print(f"lr={lr:.2f}: Test={ada_lr.score(X_test, y_test):.4f}")
lr=0.01: Test=0.9386 (200 rounds not enough with this small step) lr=0.10: Test=0.9649 lr=0.50: Test=0.9649 lr=1.00: Test=0.9649 (default) lr=2.00: Test=0.9386 (overshoots)

The tradeoff: lr=0.1 + n=500lr=1.0 + n=50 — more rounds with smaller steps often performs similarly. The classic guideline: lower lr (0.01–0.1) with high n gives better generalization on noisy data.

Hyperparameter Sweep: Base Learner Depth

AdaBoost doesn't require depth-1 stumps. Deeper trees are valid base learners:

python
for depth in [1, 2, 3, 5, None]:
    ada_d = AdaBoostClassifier(
        estimator=DecisionTreeClassifier(max_depth=depth),
        n_estimators=50, learning_rate=1.0, random_state=42
    )
    ada_d.fit(X_train, y_train)
    print(f"depth={str(depth):4s}: Train={ada_d.score(X_train, y_train):.4f}, Test={ada_d.score(X_test, y_test):.4f}")
depth=1 : Train=1.0000, Test=0.9649 depth=2 : Train=1.0000, Test=0.9561 depth=3 : Train=1.0000, Test=0.9386 depth=5 : Train=1.0000, Test=0.9298 depth=None: Train=1.0000, Test=0.9211

Deeper trees → worse test accuracy. AdaBoost needs weak base learners. If each tree already captures complex decision boundaries, boosting amplifies the memorization rather than correcting bias. The theory: AdaBoost reduces bias by combining many high-bias models; giving it low-bias models breaks the assumption.

Full Evaluation

python
from sklearn.metrics import classification_report, roc_auc_score, confusion_matrix

best_ada = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=1),
    n_estimators=50, learning_rate=1.0, algorithm='SAMME.R', random_state=42
)
best_ada.fit(X_train, y_train)
y_pred = best_ada.predict(X_test)
y_prob = best_ada.predict_proba(X_test)[:, 1]

print(classification_report(y_test, y_pred, target_names=['Malignant', 'Benign']))
print(f"AUC-ROC: {roc_auc_score(y_test, y_prob):.4f}")
print(f"\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))
precision recall f1-score support Malignant 0.93 0.98 0.95 43 Benign 0.99 0.96 0.97 71 accuracy 0.96 114 AUC-ROC: 0.9923 Confusion Matrix: [[42 1] [ 3 68]]

FP=1: one malignant tumor predicted as benign (dangerous miss). FN=3: three benign tumors predicted as malignant (unnecessary follow-up). AUC=0.9923 — the model's probability rankings are nearly perfect.

AdaBoost vs Random Forest

AspectAdaBoostRandom Forest
Error addressedHigh biasHigh variance
Base learner strengthWeak (shallow stumps) requiredStrong (deep trees) preferred
Outlier sensitivityHigh — outliers get extreme weightsLow — averaging dilutes outlier impact
TrainingSequential — cannot parallelizeParallel — fast on multi-core
Overfitting riskLow to moderateLow
When to preferClean data, high bias problemNoisy data, high variance problem
SpeedSlower (sequential)Faster (parallel trees)

Test Your Understanding

  1. The best stump in the ensemble has ε=0.0523, α=1.475. Verify using the SAMME formula: . You get ≈1.420, not 1.475. The discrepancy exists because sklearn uses SAMME.R (probability-based update). In SAMME.R, what replaces the hard class label in the weight update — and why does this produce a different α?

  2. The n_estimators sweep shows mild degradation at n=300+ (0.965 → 0.956). Yet the staged_predict curve showed test accuracy plateauing at round 20 and staying flat through round 50. If accuracy was flat at round 50, why does adding 250 more rounds (300 total) cause degradation?

  3. At learning_rate=0.01 with n=200 estimators: test accuracy=0.9386 — same as n=10 with lr=1.0. If you increased n to 2000 at lr=0.01, would the test accuracy eventually match lr=1.0 at n=50? What constraint prevents you from always using lr=0.01 + very high n in practice?

  4. The depth sweep shows that depth=None (full trees) achieves test=0.921, worse than depth=1 (0.965). Both achieve train=1.0. A single depth=None tree on this dataset achieves test≈0.92 without boosting. Why doesn't boosting 50 full trees improve on a single full tree, even though boosting 50 stumps improves dramatically over a single stump?

  5. Confusion matrix: FP=1, FN=3. In a cancer screening context, FP (predicting malignant when benign) means unnecessary biopsy; FN (predicting benign when malignant) means missed cancer. To reduce FN from 3 to 1 (accept more FP), should you raise or lower the decision threshold? What code change would you make to implement this?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment