← View series: machine learning
~/blog
Bagging and Boosting: Ensemble Intuition
No single model is perfect. A decision tree with the wrong depth misses the boundary. A linear model can't bend around nonlinear data. Ensemble methods don't solve these problems by building a better model — they build many imperfect models and combine them so their errors cancel.
Anchor dataset: 8-sample loan default dataset.
import numpy as np
import pandas as pd
# 8 samples: [income_$k, credit_score]
X = np.array([
[25, 580], [32, 610], [45, 650], [60, 680],
[70, 710], [80, 730], [90, 750], [110, 780]
])
y = np.array([1, 1, 1, 0, 0, 0, 0, 0]) # 1=default
# Class boundary near income≈55kWhy Single Learners Fail
A decision stump (depth=1 tree) on this data splits at income ≤ 55k → predict default, income > 55k → predict no_default. This correctly classifies 7 of 8 samples. The one mistake: sample 3 (income=60k, y=0) sits just over the threshold but is predicted as default.
Changing the training data slightly shifts the boundary and changes which samples are wrong. This sensitivity to small data changes is called high variance. Bagging directly addresses this.
A different failure mode: a model that's too simple to represent the true boundary — high bias. Boosting directly addresses this.
Bagging — Bootstrap AGGregating
Bagging creates diversity by training each model on a different random sample of the data. Each sample is drawn with replacement (bootstrap), so each model sees a slightly different view of the training set.
Step 1: Bootstrap Sampling
Draw n=8 samples WITH REPLACEMENT from the 8 training samples. Each bootstrap has ~63.2% unique samples; the rest are duplicates. The ~36.8% of samples not drawn are called out-of-bag (OOB) samples.
| Bootstrap | Indices (with repeats) | Samples | OOB indices |
|---|---|---|---|
| B1 | [0,0,1,2,2,4,5,6] | [25,580]×2, [32,610], [45,650]×2, [70,710], [80,730], [90,750] | 3, 7 |
| B2 | [0,1,3,4,5,5,6,7] | [25,580], [32,610], [60,680], [70,710], [80,730]×2, [90,750], [110,780] | 2 |
| B3 | [1,2,3,4,6,7,7,7] | [32,610], [45,650], [60,680], [70,710], [90,750], [110,780]×3 | 0, 5 |
Each bootstrap produces a training set where some samples appear 2–3 times and others are absent. The duplicated samples get more influence over the model trained on that bootstrap.
Step 2: Train One Model Per Bootstrap
Train a decision stump (depth=1) on each bootstrap. Bootstrap 1 has [25,580] twice and [45,650] twice, so these low-income defaulters dominate the weighted split. Each tree learns a slightly different boundary because it saw a different data distribution.
Step 3: Aggregate Predictions
For x_new = [55k income, 670 credit score]:
| Bootstrap | Tree split | Prediction for x_new |
|---|---|---|
| B1 | income ≤ 52.5 | default (1) |
| B2 | income ≤ 46.5 | default (1) |
| B3 | income ≤ 65 | no_default (0) |
| Ensemble | majority vote | default (1) — 2/3 trees |
Two trees predict default, one predicts no_default. Majority vote → default. The ensemble overrides the outlier tree.
<rect x="215" y="22" width="130" height="32" rx="6" fill="#dbeafe" stroke="#3b82f6" stroke-width="1.5"/>
<text x="280" y="42" text-anchor="middle" font-size="10" fill="#1e40af">Training Data (n=8)</text>
<line x1="240" y1="54" x2="90" y2="80" stroke="#94a3b8" stroke-width="1.5"/>
<line x1="280" y1="54" x2="280" y2="80" stroke="#94a3b8" stroke-width="1.5"/>
<line x1="320" y1="54" x2="470" y2="80" stroke="#94a3b8" stroke-width="1.5"/>
<text x="150" y="72" font-size="8" fill="#64748b">bootstrap</text>
<text x="270" y="72" font-size="8" fill="#64748b">bootstrap</text>
<text x="375" y="72" font-size="8" fill="#64748b">bootstrap</text>
<rect x="30" y="80" width="115" height="36" rx="5" fill="#fef3c7" stroke="#f59e0b" stroke-width="1.5"/>
<text x="88" y="95" text-anchor="middle" font-size="9" fill="#92400e">Bootstrap 1</text>
<text x="88" y="109" text-anchor="middle" font-size="8" fill="#92400e">OOB: {3,7}</text>
<rect x="213" y="80" width="135" height="36" rx="5" fill="#fef3c7" stroke="#f59e0b" stroke-width="1.5"/>
<text x="280" y="95" text-anchor="middle" font-size="9" fill="#92400e">Bootstrap 2</text>
<text x="280" y="109" text-anchor="middle" font-size="8" fill="#92400e">OOB: {2}</text>
<rect x="415" y="80" width="115" height="36" rx="5" fill="#fef3c7" stroke="#f59e0b" stroke-width="1.5"/>
<text x="472" y="95" text-anchor="middle" font-size="9" fill="#92400e">Bootstrap 3</text>
<text x="472" y="109" text-anchor="middle" font-size="8" fill="#92400e">OOB: {0,5}</text>
<line x1="88" y1="116" x2="88" y2="142" stroke="#94a3b8" stroke-width="1.5"/>
<line x1="280" y1="116" x2="280" y2="142" stroke="#94a3b8" stroke-width="1.5"/>
<line x1="472" y1="116" x2="472" y2="142" stroke="#94a3b8" stroke-width="1.5"/>
<rect x="35" y="142" width="105" height="28" rx="4" fill="#dcfce7" stroke="#22c55e" stroke-width="1.5"/>
<text x="88" y="161" text-anchor="middle" font-size="9" fill="#15803d">Tree 1 → 1</text>
<rect x="218" y="142" width="125" height="28" rx="4" fill="#dcfce7" stroke="#22c55e" stroke-width="1.5"/>
<text x="280" y="161" text-anchor="middle" font-size="9" fill="#15803d">Tree 2 → 1</text>
<rect x="420" y="142" width="105" height="28" rx="4" fill="#f1f5f9" stroke="#94a3b8" stroke-width="1.5"/>
<text x="472" y="161" text-anchor="middle" font-size="9" fill="#475569">Tree 3 → 0</text>
<line x1="88" y1="170" x2="240" y2="192" stroke="#94a3b8" stroke-width="1.5"/>
<line x1="280" y1="170" x2="280" y2="192" stroke="#94a3b8" stroke-width="1.5"/>
<line x1="472" y1="170" x2="320" y2="192" stroke="#94a3b8" stroke-width="1.5"/>
<rect x="200" y="192" width="160" height="24" rx="5" fill="#3b82f6"/>
<text x="280" y="209" text-anchor="middle" font-size="10" font-weight="bold" fill="white">Majority Vote → 1</text>
OOB Error Estimation
Each sample is OOB for some trees but not others. Use those trees to estimate its class — no separate validation set needed:
- Sample 3 (income=60k, y=0): OOB for Bootstrap 1 only → Tree 1 predicted default (1) → wrong
- Sample 7 (income=110k, y=0): OOB for Bootstrap 1 only → Tree 1 predicted no_default (0) → correct
OOB accuracy from these 2 samples: 50%. This increases substantially with more trees (typically 50–200 trees give reliable OOB estimates). The OOB error approximates the leave-one-out cross-validation error — a free validation score without a dedicated validation split.
Why Bagging Reduces Variance
A single deep decision tree has high variance — small changes to training data produce very different trees. Bootstrap samples create T different "versions" of the training data → T different trees. When these trees disagree (as Tree 3 did above), the majority vote suppresses the outlier. Mathematically:
Where is the average pairwise correlation between trees, is the per-tree variance, and is the number of trees. As : . Trees with lower correlation give bigger variance reductions — this is why Random Forest adds feature subsampling on top of bagging.
Boosting — Sequential Error Correction
Boosting is architecturally opposite to bagging. Instead of parallel independent models, boosting trains models sequentially — each model focuses on the errors the previous models made.
Step 1: Initialize Weights
All 8 samples start with equal weight: .
Step 2: Train Weak Learner 1
A decision stump (depth=1, income ≤ 55k) — the same stump we started with. It correctly classifies 7 of 8 samples. Sample 3 (income=60k, y=0) is predicted as default but is actually no_default.
Weighted error rate: (only sample 3 is wrong, weight = 0.125).
Learner weight:
High (close to 1) means this stump is trusted heavily — it was nearly perfect.
Step 3: Update Sample Weights
Upweight the misclassified sample (sample 3) so the next model is forced to get it right:
- Misclassified (sample 3):
- Correct (samples 0–2, 4–7):
Normalize (sum = ):
| Sample | income | y | New weight (normalized) |
|---|---|---|---|
| 0 | 25k | 1 | |
| 1 | 32k | 1 | 0.071 |
| 2 | 45k | 1 | 0.071 |
| 3 | 60k | 0 | |
| 4 | 70k | 0 | 0.071 |
| 5 | 80k | 0 | 0.071 |
| 6 | 90k | 0 | 0.071 |
| 7 | 110k | 0 | 0.071 |
Sample 3 now carries 50.1% of the total weight. Any stump that ignores sample 3 will have at least 50% weighted error — worse than random guessing.
Step 4: Train Weak Learner 2
The new stump must focus on classifying sample 3 correctly. The best split considering the new weights might be: credit_score ≤ 695 → no_default (sample 3 has credit=680), credit_score > 695 → default. This correctly labels sample 3 (no_default) at the cost of misclassifying high-credit defaulters with low weights.
Step 5: Final Prediction
Each stump contributes proportionally to its weight . The final model is a weighted combination of all stumps — a "strong learner" built from many "weak learners."
<rect x="10" y="60" width="80" height="36" rx="5" fill="#dbeafe" stroke="#3b82f6" stroke-width="1.5"/>
<text x="50" y="80" text-anchor="middle" font-size="9" fill="#1e40af">Data</text>
<text x="50" y="92" text-anchor="middle" font-size="8" fill="#1e40af">w=0.125</text>
<line x1="90" y1="78" x2="118" y2="78" stroke="#334155" stroke-width="1.5" marker-end="url(#arr)"/>
<defs><marker id="arr" markerWidth="6" markerHeight="6" refX="3" refY="3" orient="auto"><path d="M0,0 L6,3 L0,6 Z" fill="#334155"/></marker></defs>
<rect x="118" y="54" width="95" height="48" rx="5" fill="#fef3c7" stroke="#f59e0b" stroke-width="1.5"/>
<text x="165" y="74" text-anchor="middle" font-size="9" font-weight="bold" fill="#92400e">Stump 1</text>
<text x="165" y="88" text-anchor="middle" font-size="8" fill="#92400e">income≤55k</text>
<text x="165" y="100" text-anchor="middle" font-size="8" fill="#92400e">α₁=0.973</text>
<line x1="213" y1="78" x2="241" y2="78" stroke="#334155" stroke-width="1.5" marker-end="url(#arr)"/>
<rect x="241" y="60" width="80" height="36" rx="5" fill="#fee2e2" stroke="#ef4444" stroke-width="1.5"/>
<text x="281" y="78" text-anchor="middle" font-size="9" fill="#991b1b">Update</text>
<text x="281" y="91" text-anchor="middle" font-size="8" fill="#991b1b">w₃=0.501</text>
<line x1="321" y1="78" x2="349" y2="78" stroke="#334155" stroke-width="1.5" marker-end="url(#arr)"/>
<rect x="349" y="54" width="95" height="48" rx="5" fill="#fef3c7" stroke="#f59e0b" stroke-width="1.5"/>
<text x="396" y="74" text-anchor="middle" font-size="9" font-weight="bold" fill="#92400e">Stump 2</text>
<text x="396" y="88" text-anchor="middle" font-size="8" fill="#92400e">credit≤695</text>
<text x="396" y="100" text-anchor="middle" font-size="8" fill="#92400e">α₂=0.641</text>
<line x1="444" y1="78" x2="472" y2="78" stroke="#334155" stroke-width="1.5" marker-end="url(#arr)"/>
<rect x="472" y="60" width="95" height="36" rx="5" fill="#dcfce7" stroke="#22c55e" stroke-width="2"/>
<text x="519" y="78" text-anchor="middle" font-size="9" font-weight="bold" fill="#15803d">Weighted</text>
<text x="519" y="91" text-anchor="middle" font-size="8" fill="#15803d">Sum → ŷ</text>
<text x="50" y="140" text-anchor="middle" font-size="8" fill="#64748b">equal weights</text>
<text x="165" y="140" text-anchor="middle" font-size="8" fill="#64748b">miss sample 3</text>
<text x="281" y="140" text-anchor="middle" font-size="8" fill="#ef4444">sample 3 ↑↑</text>
<text x="396" y="140" text-anchor="middle" font-size="8" fill="#64748b">fix sample 3</text>
Bagging vs Boosting — When Each Wins
| Aspect | Bagging | Boosting |
|---|---|---|
| Model combination | Parallel (independent) | Sequential (dependent) |
| Error targeted | High VARIANCE | High BIAS |
| Base learner | Strong (deep tree) | Weak (stump) |
| Effect on bias | No change | Reduces bias |
| Effect on variance | Reduces variance | May increase variance |
| Risk of overfitting | Low | Higher (later stages overfit noise) |
| Sensitive to outliers | No (averaging dilutes) | Yes (outliers get high weight) |
| Examples | Random Forest | AdaBoost, Gradient Boosting, XGBoost |
Ensemble Vocabulary
| Term | Definition |
|---|---|
| Ensemble | Combining multiple models for better performance |
| Bagging | Bootstrap + AGGregation of parallel models |
| Boosting | Sequential correction of residuals or sample weights |
| Bootstrap | Sample n items with replacement from n-item dataset |
| OOB | ~36.8% of samples not in each bootstrap — free validation |
| Weak learner | Model with accuracy just above chance (depth-1 stump) |
| Strong learner | High-accuracy combination of many weak learners |
Test Your Understanding
-
The 63.2% unique sample rate for bootstrap is an asymptotic result: as , the probability that a specific sample is NOT drawn at least once is . For n=8 (our dataset), the exact probability that sample 0 is OOB is . Compute this. Is it close to 36.8%?
-
In Step 3 of Boosting, we said sample 3 gets weight 0.501 after normalization. The 7 correct samples each get weight 0.071. Verify: . Does it sum to 1? If there's a small discrepancy, where does it come from?
-
Bagging reduces variance but not bias. If the base learner is a depth-1 stump (already high bias), does bagging 100 stumps reduce the bias? Why or why not — and which method would you use instead?
-
Boosting can overfit: if you add a 100th stump that perfectly classifies all remaining noise, later stumps upweight noise samples and start memorizing them. What hyperparameter controls this in practice? What happens to training accuracy vs test accuracy as this hyperparameter increases?
-
The ensemble formula is . Stump 1 has , stump 2 has . For a new sample where both stumps agree (both predict default=1), what is ? What would need to exceed for sign to predict default?