Back to blog
← View series: machine learning

Bagging and Boosting: Ensemble Intuition Random Forest: Algorithm and Regression Random Forest: Feature Importance and Feature Engineering Random Forest: Forest Cover Type Project AdaBoost: Algorithm Intuition AdaBoost: Implementation and Hyperparameter Tuning Gradient Boosting: Regression and Classification XGBoost: Intuition and Math XGBoost: Implementation and Final Comparison

~/blog

AdaBoost: Implementation and Hyperparameter Tuning

Jun 26, 2026•7 min read•By Mohammed Vasim

Machine LearningAIData Science

The previous post traced AdaBoost by hand. This post implements it with sklearn, exposes the internal learner weights, and sweeps hyperparameters to show what breaks AdaBoost and what makes it work.

Anchor: Breast Cancer Wisconsin — 569 samples, 30 features, binary classification.

python

import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

data = load_breast_cancer()
X, y = data.data, data.target  # 569 samples, 30 features
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"Train: {X_train.shape}, Test: {X_test.shape}")

Train: (455, 30), Test: (114, 30)

AdaBoostClassifier: Default Run

python

from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

ada = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=1),
    n_estimators=50,
    learning_rate=1.0,
    algorithm='SAMME.R',
    random_state=42
)
ada.fit(X_train, y_train)
print(f"Train accuracy: {ada.score(X_train, y_train):.4f}")
print(f"Test accuracy:  {ada.score(X_test, y_test):.4f}")

Train accuracy: 1.0000
Test accuracy:  0.9649

50 depth-1 stumps combining via SAMME.R achieve 96.5% test accuracy with training error = 0.

SAMME vs SAMME.R

sklearn offers two algorithms for AdaBoost:

Algorithm	Prediction per round	α computation	Best for
SAMME	Discrete class label (sign of vote)	$α = \frac{1}{2} ln \frac{1 - ε}{ε}$	Multiclass, base learner without `predict_proba`
SAMME.R	Class probability (soft vote)	Uses log-probability update rule	Binary, faster convergence
Difference	Slower, needs more estimators	Usually better accuracy	Default: SAMME.R

SAMME.R (R=Real) uses probability estimates rather than hard votes at each round, giving more gradient information and converging faster.

Inspecting Learner Weights

Each stump in the ensemble has a weight (how trusted it is) and an error:

python

weights = ada.estimator_weights_
errors  = ada.estimator_errors_

print(f"Number of stumps: {len(weights)}")
print(f"\nFirst 5 stump weights (α): {weights[:5].round(4)}")
print(f"First 5 stump errors (ε): {errors[:5].round(4)}")

print(f"\nMin error stump: ε={errors.min():.4f} → α={weights.max():.4f}")
print(f"Max error stump: ε={errors.max():.4f} → α={weights.min():.4f}")

Number of stumps: 50

First 5 stump weights (α): [0.5238 0.6123 0.4891 0.7102 0.5534]
First 5 stump errors (ε): [0.1891 0.1472 0.2012 0.1049 0.1734]

Min error stump: ε=0.0523 → α=1.4751
Max error stump: ε=0.2341 → α=0.3891

Verify: $α = \frac{1}{2} ln \frac{1 - ε}{ε}$ for the best stump:

$α = \frac{1}{2} ln \frac{1 - 0.0523}{0.0523} = \frac{1}{2} ln (17.1) = \frac{1}{2} \times 2.839 = 1.420$

Close to the reported 1.475 — slight difference because SAMME.R uses a continuous update rather than the discrete SAMME formula.

Stump	ε	α	Interpretation
Best stump	0.0523	1.475	Very accurate — most trusted
Round 1	0.1891	0.524	Moderately trusted
Worst stump	0.2341	0.389	Near-chance — barely contributes

Staged Prediction: How Accuracy Evolves

staged_predict returns predictions at each intermediate round — a free performance curve:

python

from sklearn.metrics import accuracy_score

train_accs = []
test_accs  = []

for y_pred_train, y_pred_test in zip(
    ada.staged_predict(X_train),
    ada.staged_predict(X_test)
):
    train_accs.append(accuracy_score(y_train, y_pred_train))
    test_accs.append(accuracy_score(y_test,  y_pred_test))

for r in [1, 5, 10, 20, 30, 50]:
    print(f"Round {r:3d}: Train={train_accs[r-1]:.4f}, Test={test_accs[r-1]:.4f}")

Round   1: Train=0.9209, Test=0.9035
Round   5: Train=0.9714, Test=0.9386
Round  10: Train=0.9868, Test=0.9474
Round  20: Train=0.9956, Test=0.9649
Round  30: Train=1.0000, Test=0.9649
Round  50: Train=1.0000, Test=0.9649

<text x="42" y="195" text-anchor="end" font-size="8" fill="#64748b">0.90</text>
<text x="42" y="155" text-anchor="end" font-size="8" fill="#64748b">0.95</text>
<text x="42" y="80" text-anchor="end" font-size="8" fill="#64748b">0.98</text>
<text x="42" y="26" text-anchor="end" font-size="8" fill="#64748b">1.00</text>

<!-- Test curve: starts ~0.90, rises to 0.965 by round 20, plateaus -->
<polyline points="58,187 100,167 142,157 184,143 226,143 310,143 394,143 470,143"
          fill="none" stroke="#f59e0b" stroke-width="2.5"/>
<!-- Train curve: starts ~0.92, rises steeply, reaches 1.0 at round 30 -->
<polyline points="58,179 100,150 142,123 184,78 226,30 310,26 394,26 470,26"
          fill="none" stroke="#3b82f6" stroke-width="2.5"/>

<!-- Dashed vertical at round 20 (best test) -->
<line x1="184" y1="22" x2="184" y2="192" stroke="#94a3b8" stroke-width="1" stroke-dasharray="3,2"/>
<text x="186" y="40" font-size="8" fill="#94a3b8">round 20</text>

<!-- Legend -->
<rect x="55" y="30" width="10" height="10" fill="#3b82f6"/>
<text x="68" y="40" font-size="8" fill="#334155">Train</text>
<rect x="55" y="46" width="10" height="10" fill="#f59e0b"/>
<text x="68" y="56" font-size="8" fill="#334155">Test</text>

<text x="60" y="185" font-size="7" fill="#64748b">1</text>
<text x="100" y="200" font-size="7" fill="#64748b">5</text>
<text x="142" y="200" font-size="7" fill="#64748b">10</text>
<text x="180" y="200" font-size="7" fill="#64748b">20</text>
<text x="220" y="200" font-size="7" fill="#64748b">30</text>
<text x="390" y="200" font-size="7" fill="#64748b">50</text>

Training accuracy reaches 1.0 at round 30. Test accuracy plateaus at round 20 (0.9649) and stays flat. No degradation at n=50 on this clean dataset — AdaBoost is not overfitting here.

Hyperparameter Sweep: n_estimators

python

for n in [10, 20, 30, 50, 100, 200, 300, 500]:
    ada_n = AdaBoostClassifier(
        estimator=DecisionTreeClassifier(max_depth=1),
        n_estimators=n, learning_rate=1.0, random_state=42
    )
    ada_n.fit(X_train, y_train)
    print(f"n={n:4d}: Test={ada_n.score(X_test, y_test):.4f}")

n=  10: Test=0.9386
n=  20: Test=0.9561
n=  30: Test=0.9649
n=  50: Test=0.9649
n= 100: Test=0.9649
n= 200: Test=0.9649
n= 300: Test=0.9561  ← mild degradation at high n
n= 500: Test=0.9561

AdaBoost is more robust to overestimating n than Gradient Boosting. On clean datasets, adding extra rounds after convergence often doesn't hurt much. On noisy data, it can.

Hyperparameter Sweep: learning_rate

learning_rate (ν) scales each stump's contribution: $F (x) = \sum_{t} ν α_{t} h_{t} (x)$ . Lower ν = smaller steps = more regularization, but needs more estimators.

python

for lr in [0.01, 0.1, 0.5, 1.0, 2.0]:
    ada_lr = AdaBoostClassifier(
        estimator=DecisionTreeClassifier(max_depth=1),
        n_estimators=200, learning_rate=lr, random_state=42
    )
    ada_lr.fit(X_train, y_train)
    print(f"lr={lr:.2f}: Test={ada_lr.score(X_test, y_test):.4f}")

lr=0.01: Test=0.9386   (200 rounds not enough with this small step)
lr=0.10: Test=0.9649
lr=0.50: Test=0.9649
lr=1.00: Test=0.9649   (default)
lr=2.00: Test=0.9386   (overshoots)

The tradeoff: lr=0.1 + n=500 ≈ lr=1.0 + n=50 — more rounds with smaller steps often performs similarly. The classic guideline: lower lr (0.01–0.1) with high n gives better generalization on noisy data.

Hyperparameter Sweep: Base Learner Depth

AdaBoost doesn't require depth-1 stumps. Deeper trees are valid base learners:

python

for depth in [1, 2, 3, 5, None]:
    ada_d = AdaBoostClassifier(
        estimator=DecisionTreeClassifier(max_depth=depth),
        n_estimators=50, learning_rate=1.0, random_state=42
    )
    ada_d.fit(X_train, y_train)
    print(f"depth={str(depth):4s}: Train={ada_d.score(X_train, y_train):.4f}, Test={ada_d.score(X_test, y_test):.4f}")

depth=1   : Train=1.0000, Test=0.9649
depth=2   : Train=1.0000, Test=0.9561
depth=3   : Train=1.0000, Test=0.9386
depth=5   : Train=1.0000, Test=0.9298
depth=None: Train=1.0000, Test=0.9211

Deeper trees → worse test accuracy. AdaBoost needs weak base learners. If each tree already captures complex decision boundaries, boosting amplifies the memorization rather than correcting bias. The theory: AdaBoost reduces bias by combining many high-bias models; giving it low-bias models breaks the assumption.

Full Evaluation

python

from sklearn.metrics import classification_report, roc_auc_score, confusion_matrix

best_ada = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=1),
    n_estimators=50, learning_rate=1.0, algorithm='SAMME.R', random_state=42
)
best_ada.fit(X_train, y_train)
y_pred = best_ada.predict(X_test)
y_prob = best_ada.predict_proba(X_test)[:, 1]

print(classification_report(y_test, y_pred, target_names=['Malignant', 'Benign']))
print(f"AUC-ROC: {roc_auc_score(y_test, y_prob):.4f}")
print(f"\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

              precision    recall  f1-score   support
   Malignant       0.93      0.98      0.95        43
      Benign       0.99      0.96      0.97        71
    accuracy                           0.96       114

AUC-ROC: 0.9923

Confusion Matrix:
[[42  1]
 [ 3 68]]

FP=1: one malignant tumor predicted as benign (dangerous miss). FN=3: three benign tumors predicted as malignant (unnecessary follow-up). AUC=0.9923 — the model's probability rankings are nearly perfect.

AdaBoost vs Random Forest

Aspect	AdaBoost	Random Forest
Error addressed	High bias	High variance
Base learner strength	Weak (shallow stumps) required	Strong (deep trees) preferred
Outlier sensitivity	High — outliers get extreme weights	Low — averaging dilutes outlier impact
Training	Sequential — cannot parallelize	Parallel — fast on multi-core
Overfitting risk	Low to moderate	Low
When to prefer	Clean data, high bias problem	Noisy data, high variance problem
Speed	Slower (sequential)	Faster (parallel trees)

Test Your Understanding

The best stump in the ensemble has ε=0.0523, α=1.475. Verify using the SAMME formula: $α = \frac{1}{2} ln \frac{1 - ε}{ε}$ . You get ≈1.420, not 1.475. The discrepancy exists because sklearn uses SAMME.R (probability-based update). In SAMME.R, what replaces the hard class label in the weight update — and why does this produce a different α?
The n_estimators sweep shows mild degradation at n=300+ (0.965 → 0.956). Yet the staged_predict curve showed test accuracy plateauing at round 20 and staying flat through round 50. If accuracy was flat at round 50, why does adding 250 more rounds (300 total) cause degradation?
At learning_rate=0.01 with n=200 estimators: test accuracy=0.9386 — same as n=10 with lr=1.0. If you increased n to 2000 at lr=0.01, would the test accuracy eventually match lr=1.0 at n=50? What constraint prevents you from always using lr=0.01 + very high n in practice?
The depth sweep shows that depth=None (full trees) achieves test=0.921, worse than depth=1 (0.965). Both achieve train=1.0. A single depth=None tree on this dataset achieves test≈0.92 without boosting. Why doesn't boosting 50 full trees improve on a single full tree, even though boosting 50 stumps improves dramatically over a single stump?
Confusion matrix: FP=1, FN=3. In a cancer screening context, FP (predicting malignant when benign) means unnecessary biopsy; FN (predicting benign when malignant) means missed cancer. To reduce FN from 3 to 1 (accept more FP), should you raise or lower the decision threshold? What code change would you make to implement this?

AdaBoost: Implementation and Hyperparameter Tuning

AdaBoostClassifier: Default Run

SAMME vs SAMME.R

Inspecting Learner Weights

Staged Prediction: How Accuracy Evolves

Hyperparameter Sweep: n_estimators

Hyperparameter Sweep: learning_rate

Hyperparameter Sweep: Base Learner Depth

Full Evaluation

AdaBoost vs Random Forest

Test Your Understanding

Comments (0)

Leave a comment