← View series: machine learning
~/blog
AdaBoost: Algorithm Intuition
AdaBoost (Adaptive Boosting) builds a strong classifier from many weak ones — each a decision stump (depth-1 tree). The key: after each round, samples that were misclassified receive higher weight, forcing the next stump to focus on them. The final prediction is a weighted vote of all stumps.
Anchor dataset: 8-sample loan default (y = ±1 for AdaBoost).
import numpy as np
# 8 samples: [income_$k, credit_score]
X = np.array([[25,580],[32,610],[45,650],[60,680],
[70,710],[80,730],[90,750],[110,780]])
y = np.array([-1,-1,-1,-1,1,1,1,1]) # −1=default, +1=no_defaultSetup
- Base learner: decision stump (max_depth=1, one feature, one threshold)
- Rounds: T (typically 50–500)
- Initial weights: all equal,
- Final predictor:
The sign of the weighted sum is the prediction; the magnitude is the confidence.
Round 1
Step 1a: Find Best Weighted Stump
For each threshold, compute weighted classification error:
Test stump: income ≤ 55k → predict −1, income > 55k → predict +1.
| Sample | income | y | Correct? | if wrong | ||
|---|---|---|---|---|---|---|
| 1 | 25 | −1 | 0.125 | −1 | ✓ | 0 |
| 2 | 32 | −1 | 0.125 | −1 | ✓ | 0 |
| 3 | 45 | −1 | 0.125 | −1 | ✓ | 0 |
| 4 | 60 | −1 | 0.125 | +1 | ✗ | 0.125 |
| 5 | 70 | +1 | 0.125 | +1 | ✓ | 0 |
| 6 | 80 | +1 | 0.125 | +1 | ✓ | 0 |
| 7 | 90 | +1 | 0.125 | +1 | ✓ | 0 |
| 8 | 110 | +1 | 0.125 | +1 | ✓ | 0 |
Step 1b: Compute Learner Weight
High because this stump was nearly perfect (only 1 of 8 wrong).
Step 1c: Update Sample Weights
When the stump is correct: → multiply by (downweight).
When the stump is wrong: → multiply by (upweight).
- Correct samples (1,2,3,5,6,7,8):
- Wrong sample 4:
Normalize (sum = ):
Updated weights:
Sample 4 now holds 50% of the total weight. Any stump that misclassifies it incurs a weighted error ≥ 0.5 — worse than random guessing.
Round 2
Step 2a: Best Stump with New Weights
Sample 4 (income=60, credit=680, y=−1) dominates. The algorithm searches for a stump that classifies it correctly.
Candidate: credit_score ≤ 695 → predict −1, credit_score > 695 → predict +1.
| Sample | credit | y | Correct? | ||
|---|---|---|---|---|---|
| 1 | 580 | −1 | 0.0714 | −1 | ✓ |
| 2 | 610 | −1 | 0.0714 | −1 | ✓ |
| 3 | 650 | −1 | 0.0714 | −1 | ✓ |
| 4 | 680 | −1 | 0.4993 | −1 | ✓ |
| 5 | 710 | +1 | 0.0714 | +1 | ✓ |
| 6 | 730 | +1 | 0.0714 | +1 | ✓ |
| 7 | 750 | +1 | 0.0714 | +1 | ✓ |
| 8 | 780 | +1 | 0.0714 | +1 | ✓ |
All samples classified correctly! → — the stump is perfect. In practice, a perfect stump terminates AdaBoost early (training complete). To illustrate Round 3, use a harder boundary:
Stump 2: credit_score ≤ 720 → predict −1, credit_score > 720 → predict +1.
Sample 6 (credit=730, y=+1): predict −1 → wrong.
After weight update: sample 6 is upweighted significantly (weight ≈ 0.30 after normalization), at the expense of the other correctly classified samples.
Round 3
After Round 2, the hard samples are 4 (high weight from Round 1) and 6 (high weight from Round 2). Stump 3 must handle them simultaneously.
Candidate: income ≤ 75k → predict −1, income > 75k → predict +1.
Sample 5 (income=70, y=+1): predict −1 → wrong.
With current weights (sample 4 partially normalized, sample 6 upweighted), say :
Final Prediction
For :
| Stump | Threshold | |||
|---|---|---|---|---|
| income ≤ 55 → −1 | −1 (boundary case: 55 ≤ 55) | 0.973 | −0.973 | |
| credit ≤ 720 → −1 | −1 (685 ≤ 720) | 1.282 | −1.282 | |
| income ≤ 75 → −1 | −1 (55 ≤ 75) | 0.995 | −0.995 |
All three stumps agree. — strong confidence.
<!-- Round 1 -->
<text x="72" y="28" text-anchor="middle" font-size="9" font-weight="bold" fill="#334155">Round 1</text>
<rect x="12" y="32" width="120" height="100" fill="#f8fafc" stroke="#e2e8f0" stroke-width="1"/>
<!-- Income axis label -->
<text x="72" y="140" text-anchor="middle" font-size="7" fill="#64748b">income →</text>
<!-- All samples same size (equal weights) -->
<circle cx="22" cy="108" r="5" fill="#ef4444" opacity="0.8"/> <!-- s1 -->
<circle cx="34" cy="102" r="5" fill="#ef4444" opacity="0.8"/> <!-- s2 -->
<circle cx="52" cy="88" r="5" fill="#ef4444" opacity="0.8"/> <!-- s3 -->
<circle cx="64" cy="76" r="5" fill="#22c55e" opacity="0.8"/> <!-- s4 misclassified -->
<circle cx="82" cy="60" r="5" fill="#22c55e" opacity="0.8"/> <!-- s5 -->
<circle cx="96" cy="52" r="5" fill="#22c55e" opacity="0.8"/> <!-- s6 -->
<circle cx="108" cy="46" r="5" fill="#22c55e" opacity="0.8"/> <!-- s7 -->
<circle cx="120" cy="38" r="5" fill="#22c55e" opacity="0.8"/> <!-- s8 -->
<!-- Boundary at income=55 (x≈67) -->
<line x1="67" y1="32" x2="67" y2="132" stroke="#3b82f6" stroke-width="2" stroke-dasharray="3,2"/>
<!-- Circle the misclassified point -->
<circle cx="64" cy="76" r="10" fill="none" stroke="#f59e0b" stroke-width="2"/>
<text x="72" y="155" text-anchor="middle" font-size="7" fill="#64748b">α₁=0.973</text>
<text x="72" y="165" text-anchor="middle" font-size="7" fill="#f59e0b">⊙ sample 4 wrong</text>
<!-- Round 2 -->
<text x="218" y="28" text-anchor="middle" font-size="9" font-weight="bold" fill="#334155">Round 2</text>
<rect x="158" y="32" width="120" height="100" fill="#f8fafc" stroke="#e2e8f0" stroke-width="1"/>
<circle cx="168" cy="108" r="5" fill="#ef4444" opacity="0.8"/>
<circle cx="180" cy="102" r="5" fill="#ef4444" opacity="0.8"/>
<circle cx="198" cy="88" r="5" fill="#ef4444" opacity="0.8"/>
<circle cx="210" cy="76" r="14" fill="#22c55e" opacity="0.9"/> <!-- s4 very large -->
<circle cx="228" cy="60" r="5" fill="#22c55e" opacity="0.8"/>
<circle cx="242" cy="52" r="5" fill="#22c55e" opacity="0.8"/> <!-- s6 misclassified -->
<circle cx="254" cy="46" r="5" fill="#22c55e" opacity="0.8"/>
<circle cx="266" cy="38" r="5" fill="#22c55e" opacity="0.8"/>
<!-- Boundary: credit≤720 (shown as horizontal conceptual line) -->
<line x1="158" y1="50" x2="278" y2="50" stroke="#3b82f6" stroke-width="2" stroke-dasharray="3,2"/>
<circle cx="242" cy="52" r="10" fill="none" stroke="#f59e0b" stroke-width="2"/>
<text x="218" y="155" text-anchor="middle" font-size="7" fill="#64748b">α₂=1.282</text>
<text x="218" y="165" text-anchor="middle" font-size="7" fill="#f59e0b">⊙ sample 6 wrong</text>
<!-- Round 3 -->
<text x="364" y="28" text-anchor="middle" font-size="9" font-weight="bold" fill="#334155">Round 3</text>
<rect x="304" y="32" width="120" height="100" fill="#f8fafc" stroke="#e2e8f0" stroke-width="1"/>
<circle cx="314" cy="108" r="5" fill="#ef4444" opacity="0.8"/>
<circle cx="326" cy="102" r="5" fill="#ef4444" opacity="0.8"/>
<circle cx="344" cy="88" r="5" fill="#ef4444" opacity="0.8"/>
<circle cx="356" cy="76" r="11" fill="#22c55e" opacity="0.9"/> <!-- s4 medium-large -->
<circle cx="374" cy="60" r="5" fill="#22c55e" opacity="0.8"/> <!-- s5 misclassified -->
<circle cx="388" cy="52" r="10" fill="#22c55e" opacity="0.9"/> <!-- s6 medium-large -->
<circle cx="400" cy="46" r="5" fill="#22c55e" opacity="0.8"/>
<circle cx="412" cy="38" r="5" fill="#22c55e" opacity="0.8"/>
<line x1="372" y1="32" x2="372" y2="132" stroke="#3b82f6" stroke-width="2" stroke-dasharray="3,2"/>
<circle cx="374" cy="60" r="10" fill="none" stroke="#f59e0b" stroke-width="2"/>
<text x="364" y="155" text-anchor="middle" font-size="7" fill="#64748b">α₃=0.995</text>
<text x="364" y="165" text-anchor="middle" font-size="7" fill="#f59e0b">⊙ sample 5 wrong</text>
<!-- Final prediction panel -->
<text x="510" y="28" text-anchor="middle" font-size="9" font-weight="bold" fill="#334155">Combined</text>
<rect x="450" y="32" width="120" height="100" fill="#f8fafc" stroke="#e2e8f0" stroke-width="1"/>
<text x="510" y="60" text-anchor="middle" font-size="8" fill="#334155">F(x) = α₁h₁+α₂h₂+α₃h₃</text>
<text x="510" y="76" text-anchor="middle" font-size="8" fill="#334155">sign(sum) = ŷ</text>
<rect x="455" y="84" width="110" height="40" rx="4" fill="#dbeafe" stroke="#3b82f6" stroke-width="1.5"/>
<text x="510" y="100" text-anchor="middle" font-size="9" font-weight="bold" fill="#1e40af">Weighted Vote</text>
<text x="510" y="115" text-anchor="middle" font-size="8" fill="#1e40af">3 boundaries combined</text>
<text x="510" y="155" text-anchor="middle" font-size="7" fill="#64748b">Training error → 0</text>
<text x="510" y="165" text-anchor="middle" font-size="7" fill="#64748b">exponentially</text>
<text x="290" y="195" text-anchor="middle" font-size="9" fill="#64748b">Circle size ∝ sample weight. Yellow circle = misclassified → upweighted next round.</text>
<text x="290" y="210" text-anchor="middle" font-size="8" fill="#64748b">● = default (y=−1) ■ = no_default (y=+1) [simplified as circles]</text>
Why AdaBoost Works — The Margin View
For each sample, the margin is:
- : correctly classified, larger margin = higher confidence
- : misclassified
Training error bound: If each weak learner achieves (better than random):
This bound is exponentially decreasing in T. As long as each stump does better than a coin flip, AdaBoost's training error approaches zero. In practice, test error also decreases for many rounds — often continuing after training error reaches 0, due to growing margins on correctly classified samples.
Key Properties
| Property | Detail |
|---|---|
| Training error | Provably decreases exponentially with rounds |
| Weak learner requirement | Only needs accuracy > 50% (better than chance) |
| Bias vs variance | Reduces bias (stumps have high bias; boosting corrects systematic errors) |
| Outlier sensitivity | Outliers get exponentially high weights → single outlier can dominate |
| Overfitting risk | Low with early stopping (unlike deep trees); starts after test error plateaus |
| Noise sensitivity | Noisy labels → one sample's weight → ∞, corrupting later rounds |
AdaBoost reduces bias where Random Forest reduces variance. If your base model already has low bias (deep tree), use RF. If base models are shallow (stumps), use AdaBoost.
Test Your Understanding
-
After Round 1: . After Round 2: . Round 2's stump (credit ≤ 720) has a lower weighted error (0.0714) than Round 1 (0.125), yet . Confirm this by computing from the formula. What is the relationship between and ? At what is ?
-
If (perfect stump), . AdaBoost immediately terminates in this case because the model is already perfect (training error = 0). What does this imply about the usefulness of AdaBoost when the base learner is already a high-capacity model (e.g., a deep decision tree with low training error)?
-
The training error bound is . For T=10 rounds all with : compute the bound. For T=10 rounds all with : compute the bound. How much more lenient is AdaBoost when each weak learner is slightly weaker?
-
An outlier in the training set (mislabeled: income=25k, y=+1 — should be default but labeled no_default) will be misclassified by every reasonable stump. In Round 1 it gets upweighted. In Round 2, the stump focuses on it and might be right by luck, but then another stump misses it again. What happens to this sample's weight after T=10 rounds where it's misclassified 8 times and correctly classified 2 times?
-
AdaBoost minimizes exponential loss , while logistic regression minimizes log loss . For a correctly classified sample with : compute both losses. For a misclassified sample with : compute both. Which loss penalizes misclassification more severely? What does this explain about AdaBoost's outlier sensitivity?