Back to blog
← View series: machine learning

Bagging and Boosting: Ensemble Intuition Random Forest: Algorithm and Regression Random Forest: Feature Importance and Feature Engineering Random Forest: Forest Cover Type Project AdaBoost: Algorithm Intuition AdaBoost: Implementation and Hyperparameter Tuning Gradient Boosting: Regression and Classification XGBoost: Intuition and Math XGBoost: Implementation and Final Comparison

~/blog

Random Forest: Algorithm and Regression

Jun 26, 2026•8 min read•By Mohammed Vasim

Machine LearningAIData Science

Bagging trains independent trees on bootstrap samples. Random Forest adds one more source of randomness: at each split, only a random subset of features is considered. This decorrelates the trees and reduces the ensemble variance beyond what bootstrapping alone achieves.

Anchor: 6-sample house price dataset (2 features for hand trace), then California Housing for code.

python

import numpy as np

# Hand-trace anchor: sq_ft + bedrooms → price
X_2feat = np.array([
    [650, 2], [850, 2], [1100, 3],
    [1400, 3], [1600, 4], [1900, 4]
])
y_2feat = np.array([180, 220, 280, 340, 370, 430])

Bagging vs Random Forest: The One Difference

Bagging: bootstrap n samples → train a full decision tree on all p features.

Random Forest = Bagging + at each split, randomly select $p$ features (classification) or $p /3$ features (regression) and search only those.

For $p = 2$ features (sq_ft, bedrooms), $2 \approx 1$ → at each split, only 1 randomly chosen feature is tested. Trees can only grow using whichever feature the coin flip selected — making them different from each other even beyond the bootstrap sample differences.

Two Sources of Randomness

Source	Mechanism	Effect
Bootstrap sampling	Each tree trains on a different random resample of training data	Different trees see different proportions of samples
Feature subsampling	Each split considers only $p$ random features	Even when two trees share the same bootstrap, they split on different features

Both together make trees much less correlated than pure bagging. Less correlated trees → bigger variance reduction when averaging.

Building 3 Trees on the Anchor

With p=2 features, each split coin-flips between sq_ft and bedrooms (max_features=1 in this case).

Bootstrap 1 (indices: 0,0,2,3,4,5)

Samples: [650,2]×2, [1100,3], [1400,3], [1600,4], [1900,4]. OOB: indices 1, skipped.

Unique samples: 5. Feature at root: sq_ft (coin flip: heads).

Best sq_ft split: t=1250 (separates the two [650,2] and [1100,3] from [1400,3],[1600,4],[1900,4]).

Branch	Samples	Mean prediction
sq_ft ≤ 1250	[650,2]×2, [1100,3] → y=[180,180,280]	$\overset{y}{ˉ} = 213.3$
sq_ft > 1250	[1400,3],[1600,4],[1900,4] → y=[340,370,430]	$\overset{y}{ˉ} = 380.0$

Bootstrap 2 (indices: 0,1,2,2,4,4)

Samples: [650,2],[850,2],[1100,3]×2,[1600,4]×2. OOB: indices 3, 5.

Feature at root: bedrooms (coin flip: tails).

Best bedrooms split: t=2.5 (bedrooms ≤ 2 → indices 0,1; bedrooms > 2 → indices 2,2,4,4).

Branch	Samples	Mean prediction
bedrooms ≤ 2	[650,2],[850,2] → y=[180,220]	$\overset{y}{ˉ} = 200.0$
bedrooms > 2	[1100,3]×2,[1600,4]×2 → y=[280,280,370,370]	$\overset{y}{ˉ} = 325.0$

For x_new=[1250,3]: bedrooms=3 > 2 → predict 325.0.

Bootstrap 3 (indices: 1,2,3,4,5,5)

Samples: [850,2],[1100,3],[1400,3],[1600,4],[1900,4]×2. OOB: index 0.

Feature at root: sq_ft (coin flip: heads).

Best sq_ft split: t=1250 (similar to Tree 1 but with different sample weights due to [1900,4]×2).

Branch	Samples	Mean prediction
sq_ft ≤ 1250	[850,2],[1100,3] → y=[220,280]	$\overset{y}{ˉ} = 250.0$
sq_ft > 1250	[1400,3],[1600,4],[1900,4]×2 → y=[340,370,430,430]	$\overset{y}{ˉ} = 392.5$

For x_new=[1250,3]: sq_ft=1250 ≤ 1250 → predict 250.0.

Ensemble Prediction for x_new=[1250, 3]

Tree	Prediction
Tree 1	380.0 (right branch, sq_ft>1250)
Tree 2	325.0 (bedrooms>2 branch)
Tree 3	250.0 (left branch, sq_ft≤1250)
RF ensemble	(380.0 + 325.0 + 250.0) / 3 = 318.3

<rect x="15" y="22" width="155" height="32" rx="5" fill="#fef3c7" stroke="#f59e0b" stroke-width="1.5"/>
<text x="92" y="37" text-anchor="middle" font-size="9" font-weight="bold" fill="#92400e">Tree 1 (B1)</text>
<text x="92" y="50" text-anchor="middle" font-size="8" fill="#92400e">Feature: sq_ft | OOB: {1}</text>

<line x1="52" y1="54" x2="35" y2="80" stroke="#94a3b8" stroke-width="1.2"/>
<line x1="132" y1="54" x2="149" y2="80" stroke="#94a3b8" stroke-width="1.2"/>
<text x="28" y="75" font-size="7" fill="#64748b">≤1250</text>
<text x="118" y="75" font-size="7" fill="#64748b">>1250</text>

<rect x="10" y="80" width="60" height="28" rx="3" fill="#dcfce7" stroke="#22c55e" stroke-width="1"/>
<text x="40" y="93" text-anchor="middle" font-size="8" fill="#15803d">213.3</text>
<text x="40" y="104" text-anchor="middle" font-size="7" fill="#15803d">n=3</text>

<rect x="120" y="80" width="60" height="28" rx="3" fill="#3b82f6"/>
<text x="150" y="93" text-anchor="middle" font-size="8" fill="white">380.0 ★</text>
<text x="150" y="104" text-anchor="middle" font-size="7" fill="white">x_new here</text>

<rect x="212" y="22" width="155" height="32" rx="5" fill="#fef3c7" stroke="#f59e0b" stroke-width="1.5"/>
<text x="289" y="37" text-anchor="middle" font-size="9" font-weight="bold" fill="#92400e">Tree 2 (B2)</text>
<text x="289" y="50" text-anchor="middle" font-size="8" fill="#92400e">Feature: bedrooms | OOB: {3,5}</text>

<line x1="249" y1="54" x2="232" y2="80" stroke="#94a3b8" stroke-width="1.2"/>
<line x1="329" y1="54" x2="346" y2="80" stroke="#94a3b8" stroke-width="1.2"/>
<text x="224" y="75" font-size="7" fill="#64748b">≤2</text>
<text x="335" y="75" font-size="7" fill="#64748b">>2</text>

<rect x="207" y="80" width="60" height="28" rx="3" fill="#dcfce7" stroke="#22c55e" stroke-width="1"/>
<text x="237" y="93" text-anchor="middle" font-size="8" fill="#15803d">200.0</text>
<text x="237" y="104" text-anchor="middle" font-size="7" fill="#15803d">n=2</text>

<rect x="317" y="80" width="60" height="28" rx="3" fill="#3b82f6"/>
<text x="347" y="93" text-anchor="middle" font-size="8" fill="white">325.0 ★</text>
<text x="347" y="104" text-anchor="middle" font-size="7" fill="white">x_new here</text>

<rect x="410" y="22" width="155" height="32" rx="5" fill="#fef3c7" stroke="#f59e0b" stroke-width="1.5"/>
<text x="487" y="37" text-anchor="middle" font-size="9" font-weight="bold" fill="#92400e">Tree 3 (B3)</text>
<text x="487" y="50" text-anchor="middle" font-size="8" fill="#92400e">Feature: sq_ft | OOB: {0}</text>

<line x1="447" y1="54" x2="430" y2="80" stroke="#94a3b8" stroke-width="1.2"/>
<line x1="527" y1="54" x2="544" y2="80" stroke="#94a3b8" stroke-width="1.2"/>
<text x="422" y="75" font-size="7" fill="#64748b">≤1250</text>
<text x="533" y="75" font-size="7" fill="#64748b">>1250</text>

<rect x="405" y="80" width="60" height="28" rx="3" fill="#3b82f6"/>
<text x="435" y="93" text-anchor="middle" font-size="8" fill="white">250.0 ★</text>
<text x="435" y="104" text-anchor="middle" font-size="7" fill="white">x_new here</text>

<rect x="515" y="80" width="60" height="28" rx="3" fill="#dcfce7" stroke="#22c55e" stroke-width="1"/>
<text x="545" y="93" text-anchor="middle" font-size="8" fill="#15803d">392.5</text>
<text x="545" y="104" text-anchor="middle" font-size="7" fill="#15803d">n=4</text>

<line x1="150" y1="108" x2="245" y2="158" stroke="#94a3b8" stroke-width="1.5"/>
<line x1="347" y1="108" x2="315" y2="158" stroke="#94a3b8" stroke-width="1.5"/>
<line x1="435" y1="108" x2="335" y2="158" stroke="#94a3b8" stroke-width="1.5"/>

<rect x="205" y="158" width="175" height="30" rx="5" fill="#334155"/>
<text x="292" y="173" text-anchor="middle" font-size="10" font-weight="bold" fill="white">Average: 318.3</text>
<text x="292" y="184" text-anchor="middle" font-size="8" fill="#94a3b8">(380+325+250)/3</text>

Why Feature Subsampling Helps: The Dominant Feature Problem

Imagine features are [sq_ft, bedrooms, day_of_week, neighborhood_id].

Without feature subsampling (pure bagging): sq_ft is always the best split at the root of every tree → all trees start identically → high pairwise correlation ρ.

Recall the variance formula: $Var (ensemble) = ρ \cdot σ^{2} + \frac{1 - ρ}{T} \cdot σ^{2}$

As $T \to \infty$ , $Var \to ρ σ^{2}$ . The correlation ρ sets a floor on how much ensembling helps.

With feature subsampling ( $4 = 2$ features per split): sq_ft is excluded from ~50% of splits → different trees explore bedrooms and other features at the root → lower ρ → lower ensemble variance floor.

This is the entire justification for Random Forest's feature subsampling: not to find better splits, but to build less correlated trees.

sklearn Implementation

python

from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

data = fetch_california_housing()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Single tree baseline
dt = DecisionTreeRegressor(random_state=42)
dt.fit(X_train, y_train)

# Random Forest
rf = RandomForestRegressor(n_estimators=100, max_features='sqrt', random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)

for name, model in [('Decision Tree', dt), ('Random Forest', rf)]:
    y_pred = model.predict(X_test)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    r2 = r2_score(y_test, y_pred)
    print(f"{name:15s}: RMSE={rmse:.4f}, R²={r2:.4f}")

Decision Tree  : RMSE=0.7312, R²=0.5944
Random Forest  : RMSE=0.5031, R²=0.7698

100 trees reduce RMSE by 0.228 — a 31% improvement over the single tree — purely from ensemble averaging.

n_estimators: Diminishing Returns

python

import time

print(f"{'n_trees':>8} | {'RMSE':>8} | {'R²':>8} | {'time(s)':>8}")
for n in [1, 5, 10, 50, 100, 200, 500]:
    rf_n = RandomForestRegressor(n_estimators=n, max_features='sqrt', random_state=42, n_jobs=-1)
    t0 = time.time()
    rf_n.fit(X_train, y_train)
    elapsed = time.time() - t0
    rmse = np.sqrt(mean_squared_error(y_test, rf_n.predict(X_test)))
    r2 = r2_score(y_test, rf_n.predict(X_test))
    print(f"{n:>8} | {rmse:>8.4f} | {r2:>8.4f} | {elapsed:>8.2f}")

  n_trees |     RMSE |       R² | time(s)
        1 |   0.7312 |   0.5944 |     0.01
        5 |   0.5841 |   0.7131 |     0.05
       10 |   0.5412 |   0.7481 |     0.09
       50 |   0.5089 |   0.7672 |     0.41
      100 |   0.5031 |   0.7698 |     0.80
      200 |   0.5012 |   0.7712 |     1.54
      500 |   0.5001 |   0.7720 |     3.82

<text x="42" y="30" text-anchor="end" font-size="8" fill="#64748b">0.73</text>
<text x="42" y="97" text-anchor="end" font-size="8" fill="#64748b">0.60</text>
<text x="42" y="164" text-anchor="end" font-size="8" fill="#64748b">0.50</text>

<text x="68" y="183" font-size="8" fill="#64748b">1</text>
<text x="110" y="183" font-size="8" fill="#64748b">5</text>
<text x="160" y="183" font-size="8" fill="#64748b">10</text>
<text x="240" y="183" font-size="8" fill="#64748b">50</text>
<text x="305" y="183" font-size="8" fill="#64748b">100</text>
<text x="380" y="183" font-size="8" fill="#64748b">500</text>

<polyline points="68,24 110,96 160,116 240,138 305,143 380,147 430,148" fill="none" stroke="#3b82f6" stroke-width="2.5"/>

<line x1="305" y1="22" x2="305" y2="172" stroke="#f59e0b" stroke-width="1.5" stroke-dasharray="4,3"/>
<text x="307" y="38" font-size="8" fill="#f59e0b">n=100</text>
<text x="307" y="50" font-size="8" fill="#f59e0b">default</text>

<circle cx="68" cy="24" r="4" fill="#3b82f6"/>
<circle cx="110" cy="96" r="4" fill="#3b82f6"/>
<circle cx="160" cy="116" r="4" fill="#3b82f6"/>
<circle cx="240" cy="138" r="4" fill="#3b82f6"/>
<circle cx="305" cy="143" r="4" fill="#3b82f6"/>
<circle cx="380" cy="147" r="4" fill="#3b82f6"/>
<circle cx="430" cy="148" r="4" fill="#3b82f6"/>

Adding trees from 1→10 is high-value (RMSE drops 0.19). Adding trees from 100→500 gains only 0.003 at 4.7× more training time. The default n=100 sits on the diminishing-returns knee.

max_features: Feature Subsampling Sensitivity

python

print(f"{'max_features':>14} | {'n_features':>10} | {'RMSE':>8}")
for mf in ['sqrt', 0.3, 0.5, 0.7, 1.0]:
    rf_mf = RandomForestRegressor(n_estimators=100, max_features=mf, random_state=42, n_jobs=-1)
    rf_mf.fit(X_train, y_train)
    rmse = np.sqrt(mean_squared_error(y_test, rf_mf.predict(X_test)))
    n_feat = int(mf * 8) if isinstance(mf, float) else int(8**0.5)
    print(f"{str(mf):>14} | {n_feat:>10} | {rmse:>8.4f}")

  max_features | n_features |     RMSE
          sqrt |          2 |   0.5031
           0.3 |          2 |   0.5089
           0.5 |          4 |   0.5041
           0.7 |          5 |   0.5019
           1.0 |          8 |   0.5183  ← pure bagging (no feature subsampling)

max_features=1.0 (pure bagging) gives the worst RMSE among the options — confirming that feature subsampling helps by decorrelating trees. sqrt (the default) is near-optimal.

OOB Score: Free Validation

python

rf_oob = RandomForestRegressor(n_estimators=100, oob_score=True,
                                max_features='sqrt', random_state=42)
rf_oob.fit(X_train, y_train)
print(f"OOB  R²: {rf_oob.oob_score_:.4f}")
print(f"Test R²: {r2_score(y_test, rf_oob.predict(X_test)):.4f}")

OOB  R²: 0.7652
Test R²: 0.7698

OOB R² (0.765) is within 0.005 of the test R² (0.770). The OOB estimate is a reliable substitute for a held-out validation set — no CV needed. The small underestimate is because each sample is evaluated by fewer trees (only those for which it was OOB), but the approximation converges quickly with n_estimators ≥ 50.

Test Your Understanding

For the hand-trace anchor (n=6 samples, p=2 features), each bootstrap draws 6 samples with replacement. The expected number of unique samples is $6 \times (1 - (5/6)^{6}) \approx 4.0$ . Verify this: compute $(1 - (5/6)^{6})$ and multiply by 6. How many OOB samples does each tree typically have?
In the max_features sweep, max_features=1.0 (pure bagging) was worse than max_features='sqrt'. But in theory, considering more features at each split should find better individual splits. Why does giving each split access to all features make the ensemble worse, even though it makes individual trees better?
The variance formula says $Var \to ρ σ^{2}$ as $T \to \infty$ . In the sweep, RMSE at n=500 (0.5001) was barely better than n=100 (0.5031). What does this tell you about the minimum achievable RMSE from this approach — regardless of how many trees you add?
OOB R²=0.765 underestimates test R²=0.770 slightly. Each OOB estimate uses only ~36.8% of trees. A sample that is OOB for k trees gets a prediction averaged over those k trees. At n_estimators=10, each sample is OOB for ~3–4 trees on average. At n_estimators=100, it's ~37 trees. How does this explain the OOB underestimate at small n, and why does OOB converge to the true generalization error as n_estimators increases?
Random Forest's feature importance is computed as the total decrease in node impurity (MSE for regression) attributable to each feature, averaged over all trees. For the California Housing dataset (8 features: MedInc, HouseAge, AveRooms, AveBedrms, Population, AveOccup, Latitude, Longitude), which feature would you expect to have the highest importance, and why? (MedInc = median income in the block group.)

Random Forest: Algorithm and Regression

Bagging vs Random Forest: The One Difference

Two Sources of Randomness

Building 3 Trees on the Anchor

Bootstrap 1 (indices: 0,0,2,3,4,5)

Bootstrap 2 (indices: 0,1,2,2,4,4)

Bootstrap 3 (indices: 1,2,3,4,5,5)

Ensemble Prediction for x_new=[1250, 3]

Why Feature Subsampling Helps: The Dominant Feature Problem

sklearn Implementation

n_estimators: Diminishing Returns

max_features: Feature Subsampling Sensitivity

OOB Score: Free Validation

Test Your Understanding

Comments (0)

Leave a comment