Back to blog
← View series: machine learning

Bagging and Boosting: Ensemble Intuition Random Forest: Algorithm and Regression Random Forest: Feature Importance and Feature Engineering Random Forest: Forest Cover Type Project AdaBoost: Algorithm Intuition AdaBoost: Implementation and Hyperparameter Tuning Gradient Boosting: Regression and Classification XGBoost: Intuition and Math XGBoost: Implementation and Final Comparison

~/blog

Bagging and Boosting: Ensemble Intuition

Jun 26, 2026•9 min read•By Mohammed Vasim

Machine LearningAIData Science

No single model is perfect. A decision tree with the wrong depth misses the boundary. A linear model can't bend around nonlinear data. Ensemble methods don't solve these problems by building a better model — they build many imperfect models and combine them so their errors cancel.

Anchor dataset: 8-sample loan default dataset.

python

import numpy as np
import pandas as pd

# 8 samples: [income_$k, credit_score]
X = np.array([
    [25, 580], [32, 610], [45, 650], [60, 680],
    [70, 710], [80, 730], [90, 750], [110, 780]
])
y = np.array([1, 1, 1, 0, 0, 0, 0, 0])  # 1=default
# Class boundary near income≈55k

Why Single Learners Fail

A decision stump (depth=1 tree) on this data splits at income ≤ 55k → predict default, income > 55k → predict no_default. This correctly classifies 7 of 8 samples. The one mistake: sample 3 (income=60k, y=0) sits just over the threshold but is predicted as default.

Changing the training data slightly shifts the boundary and changes which samples are wrong. This sensitivity to small data changes is called high variance. Bagging directly addresses this.

A different failure mode: a model that's too simple to represent the true boundary — high bias. Boosting directly addresses this.

Bagging — Bootstrap AGGregating

Bagging creates diversity by training each model on a different random sample of the data. Each sample is drawn with replacement (bootstrap), so each model sees a slightly different view of the training set.

Step 1: Bootstrap Sampling

Draw n=8 samples WITH REPLACEMENT from the 8 training samples. Each bootstrap has ~63.2% unique samples; the rest are duplicates. The ~36.8% of samples not drawn are called out-of-bag (OOB) samples.

Bootstrap	Indices (with repeats)	Samples	OOB indices
B1	[0,0,1,2,2,4,5,6]	[25,580]×2, [32,610], [45,650]×2, [70,710], [80,730], [90,750]	3, 7
B2	[0,1,3,4,5,5,6,7]	[25,580], [32,610], [60,680], [70,710], [80,730]×2, [90,750], [110,780]	2
B3	[1,2,3,4,6,7,7,7]	[32,610], [45,650], [60,680], [70,710], [90,750], [110,780]×3	0, 5

Each bootstrap produces a training set where some samples appear 2–3 times and others are absent. The duplicated samples get more influence over the model trained on that bootstrap.

Step 2: Train One Model Per Bootstrap

Train a decision stump (depth=1) on each bootstrap. Bootstrap 1 has [25,580] twice and [45,650] twice, so these low-income defaulters dominate the weighted split. Each tree learns a slightly different boundary because it saw a different data distribution.

Step 3: Aggregate Predictions

For x_new = [55k income, 670 credit score]:

Bootstrap	Tree split	Prediction for x_new
B1	income ≤ 52.5	default (1)
B2	income ≤ 46.5	default (1)
B3	income ≤ 65	no_default (0)
Ensemble	majority vote	default (1) — 2/3 trees

Two trees predict default, one predicts no_default. Majority vote → default. The ensemble overrides the outlier tree.

<rect x="215" y="22" width="130" height="32" rx="6" fill="#dbeafe" stroke="#3b82f6" stroke-width="1.5"/>
<text x="280" y="42" text-anchor="middle" font-size="10" fill="#1e40af">Training Data (n=8)</text>

<line x1="240" y1="54" x2="90" y2="80" stroke="#94a3b8" stroke-width="1.5"/>
<line x1="280" y1="54" x2="280" y2="80" stroke="#94a3b8" stroke-width="1.5"/>
<line x1="320" y1="54" x2="470" y2="80" stroke="#94a3b8" stroke-width="1.5"/>
<text x="150" y="72" font-size="8" fill="#64748b">bootstrap</text>
<text x="270" y="72" font-size="8" fill="#64748b">bootstrap</text>
<text x="375" y="72" font-size="8" fill="#64748b">bootstrap</text>

<rect x="30" y="80" width="115" height="36" rx="5" fill="#fef3c7" stroke="#f59e0b" stroke-width="1.5"/>
<text x="88" y="95" text-anchor="middle" font-size="9" fill="#92400e">Bootstrap 1</text>
<text x="88" y="109" text-anchor="middle" font-size="8" fill="#92400e">OOB: {3,7}</text>

<rect x="213" y="80" width="135" height="36" rx="5" fill="#fef3c7" stroke="#f59e0b" stroke-width="1.5"/>
<text x="280" y="95" text-anchor="middle" font-size="9" fill="#92400e">Bootstrap 2</text>
<text x="280" y="109" text-anchor="middle" font-size="8" fill="#92400e">OOB: {2}</text>

<rect x="415" y="80" width="115" height="36" rx="5" fill="#fef3c7" stroke="#f59e0b" stroke-width="1.5"/>
<text x="472" y="95" text-anchor="middle" font-size="9" fill="#92400e">Bootstrap 3</text>
<text x="472" y="109" text-anchor="middle" font-size="8" fill="#92400e">OOB: {0,5}</text>

<line x1="88" y1="116" x2="88" y2="142" stroke="#94a3b8" stroke-width="1.5"/>
<line x1="280" y1="116" x2="280" y2="142" stroke="#94a3b8" stroke-width="1.5"/>
<line x1="472" y1="116" x2="472" y2="142" stroke="#94a3b8" stroke-width="1.5"/>

<rect x="35" y="142" width="105" height="28" rx="4" fill="#dcfce7" stroke="#22c55e" stroke-width="1.5"/>
<text x="88" y="161" text-anchor="middle" font-size="9" fill="#15803d">Tree 1 → 1</text>

<rect x="218" y="142" width="125" height="28" rx="4" fill="#dcfce7" stroke="#22c55e" stroke-width="1.5"/>
<text x="280" y="161" text-anchor="middle" font-size="9" fill="#15803d">Tree 2 → 1</text>

<rect x="420" y="142" width="105" height="28" rx="4" fill="#f1f5f9" stroke="#94a3b8" stroke-width="1.5"/>
<text x="472" y="161" text-anchor="middle" font-size="9" fill="#475569">Tree 3 → 0</text>

<line x1="88" y1="170" x2="240" y2="192" stroke="#94a3b8" stroke-width="1.5"/>
<line x1="280" y1="170" x2="280" y2="192" stroke="#94a3b8" stroke-width="1.5"/>
<line x1="472" y1="170" x2="320" y2="192" stroke="#94a3b8" stroke-width="1.5"/>

<rect x="200" y="192" width="160" height="24" rx="5" fill="#3b82f6"/>
<text x="280" y="209" text-anchor="middle" font-size="10" font-weight="bold" fill="white">Majority Vote → 1</text>

OOB Error Estimation

Each sample is OOB for some trees but not others. Use those trees to estimate its class — no separate validation set needed:

Sample 3 (income=60k, y=0): OOB for Bootstrap 1 only → Tree 1 predicted default (1) → wrong
Sample 7 (income=110k, y=0): OOB for Bootstrap 1 only → Tree 1 predicted no_default (0) → correct

OOB accuracy from these 2 samples: 50%. This increases substantially with more trees (typically 50–200 trees give reliable OOB estimates). The OOB error approximates the leave-one-out cross-validation error — a free validation score without a dedicated validation split.

Why Bagging Reduces Variance

A single deep decision tree has high variance — small changes to training data produce very different trees. Bootstrap samples create T different "versions" of the training data → T different trees. When these trees disagree (as Tree 3 did above), the majority vote suppresses the outlier. Mathematically:

$Var (\frac{1}{T} \sum_{t = 1}^{T} h_{t} (x)) = ρ \cdot σ^{2} + \frac{1 - ρ}{T} \cdot σ^{2}$

Where $ρ$ is the average pairwise correlation between trees, $σ^{2}$ is the per-tree variance, and $T$ is the number of trees. As $T \to \infty$ : $Var \to ρ σ^{2}$ . Trees with lower correlation give bigger variance reductions — this is why Random Forest adds feature subsampling on top of bagging.

Boosting — Sequential Error Correction

Boosting is architecturally opposite to bagging. Instead of parallel independent models, boosting trains models sequentially — each model focuses on the errors the previous models made.

Step 1: Initialize Weights

All 8 samples start with equal weight: $w_{i} = 1/ n = 1/8 = 0.125$ .

Step 2: Train Weak Learner 1

A decision stump (depth=1, income ≤ 55k) — the same stump we started with. It correctly classifies 7 of 8 samples. Sample 3 (income=60k, y=0) is predicted as default but is actually no_default.

Weighted error rate: $ε_{1} = w_{3} = 0.125$ (only sample 3 is wrong, weight = 0.125).

Learner weight: $α_{1} = \frac{1}{2} ln (\frac{1 - ε _{1}}{ε _{1}}) = \frac{1}{2} ln (\frac{0.875}{0.125}) = \frac{1}{2} ln (7) = 0.973$

High $α_{1}$ (close to 1) means this stump is trusted heavily — it was nearly perfect.

Step 3: Update Sample Weights

Upweight the misclassified sample (sample 3) so the next model is forced to get it right:

Misclassified (sample 3): $w_{3} \leftarrow 0.125 \times e^{0.973} = 0.125 \times 2.645 = 0.331$
Correct (samples 0–2, 4–7): $w_{i} \leftarrow 0.125 \times e^{- 0.973} = 0.125 \times 0.378 = 0.047$

Normalize (sum = $7 \times 0.047 + 1 \times 0.331 = 0.329 + 0.331 = 0.660$ ):

Sample	income	y	New weight (normalized)
0	25k	1	$0.047/0.660 = 0.071$
1	32k	1	0.071
2	45k	1	0.071
3	60k	0	$0.331/0.660 = 0.501$
4	70k	0	0.071
5	80k	0	0.071
6	90k	0	0.071
7	110k	0	0.071

Sample 3 now carries 50.1% of the total weight. Any stump that ignores sample 3 will have at least 50% weighted error — worse than random guessing.

Step 4: Train Weak Learner 2

The new stump must focus on classifying sample 3 correctly. The best split considering the new weights might be: credit_score ≤ 695 → no_default (sample 3 has credit=680), credit_score > 695 → default. This correctly labels sample 3 (no_default) at the cost of misclassifying high-credit defaulters with low weights.

Step 5: Final Prediction

$f (x) = α_{1} h_{1} (x) + α_{2} h_{2} (x) + \dots + α_{T} h_{T} (x)$

$\overset{y}{^} = sign (f (x))$

Each stump contributes proportionally to its weight $α_{t}$ . The final model is a weighted combination of all stumps — a "strong learner" built from many "weak learners."

<rect x="10" y="60" width="80" height="36" rx="5" fill="#dbeafe" stroke="#3b82f6" stroke-width="1.5"/>
<text x="50" y="80" text-anchor="middle" font-size="9" fill="#1e40af">Data</text>
<text x="50" y="92" text-anchor="middle" font-size="8" fill="#1e40af">w=0.125</text>

<line x1="90" y1="78" x2="118" y2="78" stroke="#334155" stroke-width="1.5" marker-end="url(#arr)"/>
<defs><marker id="arr" markerWidth="6" markerHeight="6" refX="3" refY="3" orient="auto"><path d="M0,0 L6,3 L0,6 Z" fill="#334155"/></marker></defs>

<rect x="118" y="54" width="95" height="48" rx="5" fill="#fef3c7" stroke="#f59e0b" stroke-width="1.5"/>
<text x="165" y="74" text-anchor="middle" font-size="9" font-weight="bold" fill="#92400e">Stump 1</text>
<text x="165" y="88" text-anchor="middle" font-size="8" fill="#92400e">income≤55k</text>
<text x="165" y="100" text-anchor="middle" font-size="8" fill="#92400e">α₁=0.973</text>

<line x1="213" y1="78" x2="241" y2="78" stroke="#334155" stroke-width="1.5" marker-end="url(#arr)"/>

<rect x="241" y="60" width="80" height="36" rx="5" fill="#fee2e2" stroke="#ef4444" stroke-width="1.5"/>
<text x="281" y="78" text-anchor="middle" font-size="9" fill="#991b1b">Update</text>
<text x="281" y="91" text-anchor="middle" font-size="8" fill="#991b1b">w₃=0.501</text>

<line x1="321" y1="78" x2="349" y2="78" stroke="#334155" stroke-width="1.5" marker-end="url(#arr)"/>

<rect x="349" y="54" width="95" height="48" rx="5" fill="#fef3c7" stroke="#f59e0b" stroke-width="1.5"/>
<text x="396" y="74" text-anchor="middle" font-size="9" font-weight="bold" fill="#92400e">Stump 2</text>
<text x="396" y="88" text-anchor="middle" font-size="8" fill="#92400e">credit≤695</text>
<text x="396" y="100" text-anchor="middle" font-size="8" fill="#92400e">α₂=0.641</text>

<line x1="444" y1="78" x2="472" y2="78" stroke="#334155" stroke-width="1.5" marker-end="url(#arr)"/>

<rect x="472" y="60" width="95" height="36" rx="5" fill="#dcfce7" stroke="#22c55e" stroke-width="2"/>
<text x="519" y="78" text-anchor="middle" font-size="9" font-weight="bold" fill="#15803d">Weighted</text>
<text x="519" y="91" text-anchor="middle" font-size="8" fill="#15803d">Sum → ŷ</text>

<text x="50" y="140" text-anchor="middle" font-size="8" fill="#64748b">equal weights</text>
<text x="165" y="140" text-anchor="middle" font-size="8" fill="#64748b">miss sample 3</text>
<text x="281" y="140" text-anchor="middle" font-size="8" fill="#ef4444">sample 3 ↑↑</text>
<text x="396" y="140" text-anchor="middle" font-size="8" fill="#64748b">fix sample 3</text>

Bagging vs Boosting — When Each Wins

Aspect	Bagging	Boosting
Model combination	Parallel (independent)	Sequential (dependent)
Error targeted	High VARIANCE	High BIAS
Base learner	Strong (deep tree)	Weak (stump)
Effect on bias	No change	Reduces bias
Effect on variance	Reduces variance	May increase variance
Risk of overfitting	Low	Higher (later stages overfit noise)
Sensitive to outliers	No (averaging dilutes)	Yes (outliers get high weight)
Examples	Random Forest	AdaBoost, Gradient Boosting, XGBoost

Ensemble Vocabulary

Term	Definition
Ensemble	Combining multiple models for better performance
Bagging	Bootstrap + AGGregation of parallel models
Boosting	Sequential correction of residuals or sample weights
Bootstrap	Sample n items with replacement from n-item dataset
OOB	~36.8% of samples not in each bootstrap — free validation
Weak learner	Model with accuracy just above chance (depth-1 stump)
Strong learner	High-accuracy combination of many weak learners

Test Your Understanding

The 63.2% unique sample rate for bootstrap is an asymptotic result: as $n \to \infty$ , the probability that a specific sample is NOT drawn at least once is $(1 - 1/ n)^{n} \to 1/ e \approx 0.368$ . For n=8 (our dataset), the exact probability that sample 0 is OOB is $(7/8)^{8}$ . Compute this. Is it close to 36.8%?
In Step 3 of Boosting, we said sample 3 gets weight 0.501 after normalization. The 7 correct samples each get weight 0.071. Verify: $7 \times 0.071 + 1 \times 0.501 = ?$ . Does it sum to 1? If there's a small discrepancy, where does it come from?
Bagging reduces variance but not bias. If the base learner is a depth-1 stump (already high bias), does bagging 100 stumps reduce the bias? Why or why not — and which method would you use instead?
Boosting can overfit: if you add a 100th stump that perfectly classifies all remaining noise, later stumps upweight noise samples and start memorizing them. What hyperparameter controls this in practice? What happens to training accuracy vs test accuracy as this hyperparameter increases?
The ensemble formula is $f (x) = \sum_{t} α_{t} h_{t} (x)$ . Stump 1 has $α_{1} = 0.973$ , stump 2 has $α_{2} = 0.641$ . For a new sample where both stumps agree (both predict default=1), what is $f (x)$ ? What would $f (x)$ need to exceed for sign $(f (x))$ to predict default?