Back to blog
← View series: machine learning

~/blog

Bagging and Boosting: Ensemble Intuition

Jun 26, 20269 min readBy Mohammed Vasim
Machine LearningAIData Science

No single model is perfect. A decision tree with the wrong depth misses the boundary. A linear model can't bend around nonlinear data. Ensemble methods don't solve these problems by building a better model — they build many imperfect models and combine them so their errors cancel.

Anchor dataset: 8-sample loan default dataset.

python
import numpy as np
import pandas as pd

# 8 samples: [income_$k, credit_score]
X = np.array([
    [25, 580], [32, 610], [45, 650], [60, 680],
    [70, 710], [80, 730], [90, 750], [110, 780]
])
y = np.array([1, 1, 1, 0, 0, 0, 0, 0])  # 1=default
# Class boundary near income≈55k

Why Single Learners Fail

A decision stump (depth=1 tree) on this data splits at income ≤ 55k → predict default, income > 55k → predict no_default. This correctly classifies 7 of 8 samples. The one mistake: sample 3 (income=60k, y=0) sits just over the threshold but is predicted as default.

Changing the training data slightly shifts the boundary and changes which samples are wrong. This sensitivity to small data changes is called high variance. Bagging directly addresses this.

A different failure mode: a model that's too simple to represent the true boundary — high bias. Boosting directly addresses this.

Bagging — Bootstrap AGGregating

Bagging creates diversity by training each model on a different random sample of the data. Each sample is drawn with replacement (bootstrap), so each model sees a slightly different view of the training set.

Step 1: Bootstrap Sampling

Draw n=8 samples WITH REPLACEMENT from the 8 training samples. Each bootstrap has ~63.2% unique samples; the rest are duplicates. The ~36.8% of samples not drawn are called out-of-bag (OOB) samples.

BootstrapIndices (with repeats)SamplesOOB indices
B1[0,0,1,2,2,4,5,6][25,580]×2, [32,610], [45,650]×2, [70,710], [80,730], [90,750]3, 7
B2[0,1,3,4,5,5,6,7][25,580], [32,610], [60,680], [70,710], [80,730]×2, [90,750], [110,780]2
B3[1,2,3,4,6,7,7,7][32,610], [45,650], [60,680], [70,710], [90,750], [110,780]×30, 5

Each bootstrap produces a training set where some samples appear 2–3 times and others are absent. The duplicated samples get more influence over the model trained on that bootstrap.

Step 2: Train One Model Per Bootstrap

Train a decision stump (depth=1) on each bootstrap. Bootstrap 1 has [25,580] twice and [45,650] twice, so these low-income defaulters dominate the weighted split. Each tree learns a slightly different boundary because it saw a different data distribution.

Step 3: Aggregate Predictions

For x_new = [55k income, 670 credit score]:

BootstrapTree splitPrediction for x_new
B1income ≤ 52.5default (1)
B2income ≤ 46.5default (1)
B3income ≤ 65no_default (0)
Ensemblemajority votedefault (1) — 2/3 trees

Two trees predict default, one predicts no_default. Majority vote → default. The ensemble overrides the outlier tree.

Bagging: Parallel Bootstrap Trees <rect x="215" y="22" width="130" height="32" rx="6" fill="#dbeafe" stroke="#3b82f6" stroke-width="1.5"/> <text x="280" y="42" text-anchor="middle" font-size="10" fill="#1e40af">Training Data (n=8)</text> <line x1="240" y1="54" x2="90" y2="80" stroke="#94a3b8" stroke-width="1.5"/> <line x1="280" y1="54" x2="280" y2="80" stroke="#94a3b8" stroke-width="1.5"/> <line x1="320" y1="54" x2="470" y2="80" stroke="#94a3b8" stroke-width="1.5"/> <text x="150" y="72" font-size="8" fill="#64748b">bootstrap</text> <text x="270" y="72" font-size="8" fill="#64748b">bootstrap</text> <text x="375" y="72" font-size="8" fill="#64748b">bootstrap</text> <rect x="30" y="80" width="115" height="36" rx="5" fill="#fef3c7" stroke="#f59e0b" stroke-width="1.5"/> <text x="88" y="95" text-anchor="middle" font-size="9" fill="#92400e">Bootstrap 1</text> <text x="88" y="109" text-anchor="middle" font-size="8" fill="#92400e">OOB: {3,7}</text> <rect x="213" y="80" width="135" height="36" rx="5" fill="#fef3c7" stroke="#f59e0b" stroke-width="1.5"/> <text x="280" y="95" text-anchor="middle" font-size="9" fill="#92400e">Bootstrap 2</text> <text x="280" y="109" text-anchor="middle" font-size="8" fill="#92400e">OOB: {2}</text> <rect x="415" y="80" width="115" height="36" rx="5" fill="#fef3c7" stroke="#f59e0b" stroke-width="1.5"/> <text x="472" y="95" text-anchor="middle" font-size="9" fill="#92400e">Bootstrap 3</text> <text x="472" y="109" text-anchor="middle" font-size="8" fill="#92400e">OOB: {0,5}</text> <line x1="88" y1="116" x2="88" y2="142" stroke="#94a3b8" stroke-width="1.5"/> <line x1="280" y1="116" x2="280" y2="142" stroke="#94a3b8" stroke-width="1.5"/> <line x1="472" y1="116" x2="472" y2="142" stroke="#94a3b8" stroke-width="1.5"/> <rect x="35" y="142" width="105" height="28" rx="4" fill="#dcfce7" stroke="#22c55e" stroke-width="1.5"/> <text x="88" y="161" text-anchor="middle" font-size="9" fill="#15803d">Tree 1 → 1</text> <rect x="218" y="142" width="125" height="28" rx="4" fill="#dcfce7" stroke="#22c55e" stroke-width="1.5"/> <text x="280" y="161" text-anchor="middle" font-size="9" fill="#15803d">Tree 2 → 1</text> <rect x="420" y="142" width="105" height="28" rx="4" fill="#f1f5f9" stroke="#94a3b8" stroke-width="1.5"/> <text x="472" y="161" text-anchor="middle" font-size="9" fill="#475569">Tree 3 → 0</text> <line x1="88" y1="170" x2="240" y2="192" stroke="#94a3b8" stroke-width="1.5"/> <line x1="280" y1="170" x2="280" y2="192" stroke="#94a3b8" stroke-width="1.5"/> <line x1="472" y1="170" x2="320" y2="192" stroke="#94a3b8" stroke-width="1.5"/> <rect x="200" y="192" width="160" height="24" rx="5" fill="#3b82f6"/> <text x="280" y="209" text-anchor="middle" font-size="10" font-weight="bold" fill="white">Majority Vote → 1</text>

OOB Error Estimation

Each sample is OOB for some trees but not others. Use those trees to estimate its class — no separate validation set needed:

  • Sample 3 (income=60k, y=0): OOB for Bootstrap 1 only → Tree 1 predicted default (1) → wrong
  • Sample 7 (income=110k, y=0): OOB for Bootstrap 1 only → Tree 1 predicted no_default (0) → correct

OOB accuracy from these 2 samples: 50%. This increases substantially with more trees (typically 50–200 trees give reliable OOB estimates). The OOB error approximates the leave-one-out cross-validation error — a free validation score without a dedicated validation split.

Why Bagging Reduces Variance

A single deep decision tree has high variance — small changes to training data produce very different trees. Bootstrap samples create T different "versions" of the training data → T different trees. When these trees disagree (as Tree 3 did above), the majority vote suppresses the outlier. Mathematically:

Where is the average pairwise correlation between trees, is the per-tree variance, and is the number of trees. As : . Trees with lower correlation give bigger variance reductions — this is why Random Forest adds feature subsampling on top of bagging.

Boosting — Sequential Error Correction

Boosting is architecturally opposite to bagging. Instead of parallel independent models, boosting trains models sequentially — each model focuses on the errors the previous models made.

Step 1: Initialize Weights

All 8 samples start with equal weight: .

Step 2: Train Weak Learner 1

A decision stump (depth=1, income ≤ 55k) — the same stump we started with. It correctly classifies 7 of 8 samples. Sample 3 (income=60k, y=0) is predicted as default but is actually no_default.

Weighted error rate: (only sample 3 is wrong, weight = 0.125).

Learner weight:

High (close to 1) means this stump is trusted heavily — it was nearly perfect.

Step 3: Update Sample Weights

Upweight the misclassified sample (sample 3) so the next model is forced to get it right:

  • Misclassified (sample 3):
  • Correct (samples 0–2, 4–7):

Normalize (sum = ):

SampleincomeyNew weight (normalized)
025k1
132k10.071
245k10.071
360k0
470k00.071
580k00.071
690k00.071
7110k00.071

Sample 3 now carries 50.1% of the total weight. Any stump that ignores sample 3 will have at least 50% weighted error — worse than random guessing.

Step 4: Train Weak Learner 2

The new stump must focus on classifying sample 3 correctly. The best split considering the new weights might be: credit_score ≤ 695 → no_default (sample 3 has credit=680), credit_score > 695 → default. This correctly labels sample 3 (no_default) at the cost of misclassifying high-credit defaulters with low weights.

Step 5: Final Prediction

Each stump contributes proportionally to its weight . The final model is a weighted combination of all stumps — a "strong learner" built from many "weak learners."

Boosting: Sequential Error Correction <rect x="10" y="60" width="80" height="36" rx="5" fill="#dbeafe" stroke="#3b82f6" stroke-width="1.5"/> <text x="50" y="80" text-anchor="middle" font-size="9" fill="#1e40af">Data</text> <text x="50" y="92" text-anchor="middle" font-size="8" fill="#1e40af">w=0.125</text> <line x1="90" y1="78" x2="118" y2="78" stroke="#334155" stroke-width="1.5" marker-end="url(#arr)"/> <defs><marker id="arr" markerWidth="6" markerHeight="6" refX="3" refY="3" orient="auto"><path d="M0,0 L6,3 L0,6 Z" fill="#334155"/></marker></defs> <rect x="118" y="54" width="95" height="48" rx="5" fill="#fef3c7" stroke="#f59e0b" stroke-width="1.5"/> <text x="165" y="74" text-anchor="middle" font-size="9" font-weight="bold" fill="#92400e">Stump 1</text> <text x="165" y="88" text-anchor="middle" font-size="8" fill="#92400e">income≤55k</text> <text x="165" y="100" text-anchor="middle" font-size="8" fill="#92400e">α₁=0.973</text> <line x1="213" y1="78" x2="241" y2="78" stroke="#334155" stroke-width="1.5" marker-end="url(#arr)"/> <rect x="241" y="60" width="80" height="36" rx="5" fill="#fee2e2" stroke="#ef4444" stroke-width="1.5"/> <text x="281" y="78" text-anchor="middle" font-size="9" fill="#991b1b">Update</text> <text x="281" y="91" text-anchor="middle" font-size="8" fill="#991b1b">w₃=0.501</text> <line x1="321" y1="78" x2="349" y2="78" stroke="#334155" stroke-width="1.5" marker-end="url(#arr)"/> <rect x="349" y="54" width="95" height="48" rx="5" fill="#fef3c7" stroke="#f59e0b" stroke-width="1.5"/> <text x="396" y="74" text-anchor="middle" font-size="9" font-weight="bold" fill="#92400e">Stump 2</text> <text x="396" y="88" text-anchor="middle" font-size="8" fill="#92400e">credit≤695</text> <text x="396" y="100" text-anchor="middle" font-size="8" fill="#92400e">α₂=0.641</text> <line x1="444" y1="78" x2="472" y2="78" stroke="#334155" stroke-width="1.5" marker-end="url(#arr)"/> <rect x="472" y="60" width="95" height="36" rx="5" fill="#dcfce7" stroke="#22c55e" stroke-width="2"/> <text x="519" y="78" text-anchor="middle" font-size="9" font-weight="bold" fill="#15803d">Weighted</text> <text x="519" y="91" text-anchor="middle" font-size="8" fill="#15803d">Sum → ŷ</text> <text x="50" y="140" text-anchor="middle" font-size="8" fill="#64748b">equal weights</text> <text x="165" y="140" text-anchor="middle" font-size="8" fill="#64748b">miss sample 3</text> <text x="281" y="140" text-anchor="middle" font-size="8" fill="#ef4444">sample 3 ↑↑</text> <text x="396" y="140" text-anchor="middle" font-size="8" fill="#64748b">fix sample 3</text>

Bagging vs Boosting — When Each Wins

AspectBaggingBoosting
Model combinationParallel (independent)Sequential (dependent)
Error targetedHigh VARIANCEHigh BIAS
Base learnerStrong (deep tree)Weak (stump)
Effect on biasNo changeReduces bias
Effect on varianceReduces varianceMay increase variance
Risk of overfittingLowHigher (later stages overfit noise)
Sensitive to outliersNo (averaging dilutes)Yes (outliers get high weight)
ExamplesRandom ForestAdaBoost, Gradient Boosting, XGBoost

Ensemble Vocabulary

TermDefinition
EnsembleCombining multiple models for better performance
BaggingBootstrap + AGGregation of parallel models
BoostingSequential correction of residuals or sample weights
BootstrapSample n items with replacement from n-item dataset
OOB~36.8% of samples not in each bootstrap — free validation
Weak learnerModel with accuracy just above chance (depth-1 stump)
Strong learnerHigh-accuracy combination of many weak learners

Test Your Understanding

  1. The 63.2% unique sample rate for bootstrap is an asymptotic result: as , the probability that a specific sample is NOT drawn at least once is . For n=8 (our dataset), the exact probability that sample 0 is OOB is . Compute this. Is it close to 36.8%?

  2. In Step 3 of Boosting, we said sample 3 gets weight 0.501 after normalization. The 7 correct samples each get weight 0.071. Verify: . Does it sum to 1? If there's a small discrepancy, where does it come from?

  3. Bagging reduces variance but not bias. If the base learner is a depth-1 stump (already high bias), does bagging 100 stumps reduce the bias? Why or why not — and which method would you use instead?

  4. Boosting can overfit: if you add a 100th stump that perfectly classifies all remaining noise, later stumps upweight noise samples and start memorizing them. What hyperparameter controls this in practice? What happens to training accuracy vs test accuracy as this hyperparameter increases?

  5. The ensemble formula is . Stump 1 has , stump 2 has . For a new sample where both stumps agree (both predict default=1), what is ? What would need to exceed for sign to predict default?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment