Back to blog
← View series: machine learning

~/blog

Random Forest: Algorithm and Regression

Jun 26, 20268 min readBy Mohammed Vasim
Machine LearningAIData Science

Bagging trains independent trees on bootstrap samples. Random Forest adds one more source of randomness: at each split, only a random subset of features is considered. This decorrelates the trees and reduces the ensemble variance beyond what bootstrapping alone achieves.

Anchor: 6-sample house price dataset (2 features for hand trace), then California Housing for code.

python
import numpy as np

# Hand-trace anchor: sq_ft + bedrooms → price
X_2feat = np.array([
    [650, 2], [850, 2], [1100, 3],
    [1400, 3], [1600, 4], [1900, 4]
])
y_2feat = np.array([180, 220, 280, 340, 370, 430])

Bagging vs Random Forest: The One Difference

Bagging: bootstrap n samples → train a full decision tree on all p features.

Random Forest = Bagging + at each split, randomly select features (classification) or features (regression) and search only those.

For features (sq_ft, bedrooms), → at each split, only 1 randomly chosen feature is tested. Trees can only grow using whichever feature the coin flip selected — making them different from each other even beyond the bootstrap sample differences.

Two Sources of Randomness

SourceMechanismEffect
Bootstrap samplingEach tree trains on a different random resample of training dataDifferent trees see different proportions of samples
Feature subsamplingEach split considers only random featuresEven when two trees share the same bootstrap, they split on different features

Both together make trees much less correlated than pure bagging. Less correlated trees → bigger variance reduction when averaging.

Building 3 Trees on the Anchor

With p=2 features, each split coin-flips between sq_ft and bedrooms (max_features=1 in this case).

Bootstrap 1 (indices: 0,0,2,3,4,5)

Samples: [650,2]×2, [1100,3], [1400,3], [1600,4], [1900,4]. OOB: indices 1, skipped.

Unique samples: 5. Feature at root: sq_ft (coin flip: heads).

Best sq_ft split: t=1250 (separates the two [650,2] and [1100,3] from [1400,3],[1600,4],[1900,4]).

BranchSamplesMean prediction
sq_ft ≤ 1250[650,2]×2, [1100,3] → y=[180,180,280]
sq_ft > 1250[1400,3],[1600,4],[1900,4] → y=[340,370,430]

Bootstrap 2 (indices: 0,1,2,2,4,4)

Samples: [650,2],[850,2],[1100,3]×2,[1600,4]×2. OOB: indices 3, 5.

Feature at root: bedrooms (coin flip: tails).

Best bedrooms split: t=2.5 (bedrooms ≤ 2 → indices 0,1; bedrooms > 2 → indices 2,2,4,4).

BranchSamplesMean prediction
bedrooms ≤ 2[650,2],[850,2] → y=[180,220]
bedrooms > 2[1100,3]×2,[1600,4]×2 → y=[280,280,370,370]

For x_new=[1250,3]: bedrooms=3 > 2 → predict 325.0.

Bootstrap 3 (indices: 1,2,3,4,5,5)

Samples: [850,2],[1100,3],[1400,3],[1600,4],[1900,4]×2. OOB: index 0.

Feature at root: sq_ft (coin flip: heads).

Best sq_ft split: t=1250 (similar to Tree 1 but with different sample weights due to [1900,4]×2).

BranchSamplesMean prediction
sq_ft ≤ 1250[850,2],[1100,3] → y=[220,280]
sq_ft > 1250[1400,3],[1600,4],[1900,4]×2 → y=[340,370,430,430]

For x_new=[1250,3]: sq_ft=1250 ≤ 1250 → predict 250.0.

Ensemble Prediction for x_new=[1250, 3]

TreePrediction
Tree 1380.0 (right branch, sq_ft>1250)
Tree 2325.0 (bedrooms>2 branch)
Tree 3250.0 (left branch, sq_ft≤1250)
RF ensemble(380.0 + 325.0 + 250.0) / 3 = 318.3
Random Forest: 3 Trees → Average <rect x="15" y="22" width="155" height="32" rx="5" fill="#fef3c7" stroke="#f59e0b" stroke-width="1.5"/> <text x="92" y="37" text-anchor="middle" font-size="9" font-weight="bold" fill="#92400e">Tree 1 (B1)</text> <text x="92" y="50" text-anchor="middle" font-size="8" fill="#92400e">Feature: sq_ft | OOB: {1}</text> <line x1="52" y1="54" x2="35" y2="80" stroke="#94a3b8" stroke-width="1.2"/> <line x1="132" y1="54" x2="149" y2="80" stroke="#94a3b8" stroke-width="1.2"/> <text x="28" y="75" font-size="7" fill="#64748b">≤1250</text> <text x="118" y="75" font-size="7" fill="#64748b">>1250</text> <rect x="10" y="80" width="60" height="28" rx="3" fill="#dcfce7" stroke="#22c55e" stroke-width="1"/> <text x="40" y="93" text-anchor="middle" font-size="8" fill="#15803d">213.3</text> <text x="40" y="104" text-anchor="middle" font-size="7" fill="#15803d">n=3</text> <rect x="120" y="80" width="60" height="28" rx="3" fill="#3b82f6"/> <text x="150" y="93" text-anchor="middle" font-size="8" fill="white">380.0 ★</text> <text x="150" y="104" text-anchor="middle" font-size="7" fill="white">x_new here</text> <rect x="212" y="22" width="155" height="32" rx="5" fill="#fef3c7" stroke="#f59e0b" stroke-width="1.5"/> <text x="289" y="37" text-anchor="middle" font-size="9" font-weight="bold" fill="#92400e">Tree 2 (B2)</text> <text x="289" y="50" text-anchor="middle" font-size="8" fill="#92400e">Feature: bedrooms | OOB: {3,5}</text> <line x1="249" y1="54" x2="232" y2="80" stroke="#94a3b8" stroke-width="1.2"/> <line x1="329" y1="54" x2="346" y2="80" stroke="#94a3b8" stroke-width="1.2"/> <text x="224" y="75" font-size="7" fill="#64748b">≤2</text> <text x="335" y="75" font-size="7" fill="#64748b">>2</text> <rect x="207" y="80" width="60" height="28" rx="3" fill="#dcfce7" stroke="#22c55e" stroke-width="1"/> <text x="237" y="93" text-anchor="middle" font-size="8" fill="#15803d">200.0</text> <text x="237" y="104" text-anchor="middle" font-size="7" fill="#15803d">n=2</text> <rect x="317" y="80" width="60" height="28" rx="3" fill="#3b82f6"/> <text x="347" y="93" text-anchor="middle" font-size="8" fill="white">325.0 ★</text> <text x="347" y="104" text-anchor="middle" font-size="7" fill="white">x_new here</text> <rect x="410" y="22" width="155" height="32" rx="5" fill="#fef3c7" stroke="#f59e0b" stroke-width="1.5"/> <text x="487" y="37" text-anchor="middle" font-size="9" font-weight="bold" fill="#92400e">Tree 3 (B3)</text> <text x="487" y="50" text-anchor="middle" font-size="8" fill="#92400e">Feature: sq_ft | OOB: {0}</text> <line x1="447" y1="54" x2="430" y2="80" stroke="#94a3b8" stroke-width="1.2"/> <line x1="527" y1="54" x2="544" y2="80" stroke="#94a3b8" stroke-width="1.2"/> <text x="422" y="75" font-size="7" fill="#64748b">≤1250</text> <text x="533" y="75" font-size="7" fill="#64748b">>1250</text> <rect x="405" y="80" width="60" height="28" rx="3" fill="#3b82f6"/> <text x="435" y="93" text-anchor="middle" font-size="8" fill="white">250.0 ★</text> <text x="435" y="104" text-anchor="middle" font-size="7" fill="white">x_new here</text> <rect x="515" y="80" width="60" height="28" rx="3" fill="#dcfce7" stroke="#22c55e" stroke-width="1"/> <text x="545" y="93" text-anchor="middle" font-size="8" fill="#15803d">392.5</text> <text x="545" y="104" text-anchor="middle" font-size="7" fill="#15803d">n=4</text> <line x1="150" y1="108" x2="245" y2="158" stroke="#94a3b8" stroke-width="1.5"/> <line x1="347" y1="108" x2="315" y2="158" stroke="#94a3b8" stroke-width="1.5"/> <line x1="435" y1="108" x2="335" y2="158" stroke="#94a3b8" stroke-width="1.5"/> <rect x="205" y="158" width="175" height="30" rx="5" fill="#334155"/> <text x="292" y="173" text-anchor="middle" font-size="10" font-weight="bold" fill="white">Average: 318.3</text> <text x="292" y="184" text-anchor="middle" font-size="8" fill="#94a3b8">(380+325+250)/3</text>

Why Feature Subsampling Helps: The Dominant Feature Problem

Imagine features are [sq_ft, bedrooms, day_of_week, neighborhood_id].

Without feature subsampling (pure bagging): sq_ft is always the best split at the root of every tree → all trees start identically → high pairwise correlation ρ.

Recall the variance formula:

As , . The correlation ρ sets a floor on how much ensembling helps.

With feature subsampling ( features per split): sq_ft is excluded from ~50% of splits → different trees explore bedrooms and other features at the root → lower ρ → lower ensemble variance floor.

This is the entire justification for Random Forest's feature subsampling: not to find better splits, but to build less correlated trees.

sklearn Implementation

python
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

data = fetch_california_housing()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Single tree baseline
dt = DecisionTreeRegressor(random_state=42)
dt.fit(X_train, y_train)

# Random Forest
rf = RandomForestRegressor(n_estimators=100, max_features='sqrt', random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)

for name, model in [('Decision Tree', dt), ('Random Forest', rf)]:
    y_pred = model.predict(X_test)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    r2 = r2_score(y_test, y_pred)
    print(f"{name:15s}: RMSE={rmse:.4f}, R²={r2:.4f}")
Decision Tree : RMSE=0.7312, R²=0.5944 Random Forest : RMSE=0.5031, R²=0.7698

100 trees reduce RMSE by 0.228 — a 31% improvement over the single tree — purely from ensemble averaging.

n_estimators: Diminishing Returns

python
import time

print(f"{'n_trees':>8} | {'RMSE':>8} | {'R²':>8} | {'time(s)':>8}")
for n in [1, 5, 10, 50, 100, 200, 500]:
    rf_n = RandomForestRegressor(n_estimators=n, max_features='sqrt', random_state=42, n_jobs=-1)
    t0 = time.time()
    rf_n.fit(X_train, y_train)
    elapsed = time.time() - t0
    rmse = np.sqrt(mean_squared_error(y_test, rf_n.predict(X_test)))
    r2 = r2_score(y_test, rf_n.predict(X_test))
    print(f"{n:>8} | {rmse:>8.4f} | {r2:>8.4f} | {elapsed:>8.2f}")
n_trees | RMSE | R² | time(s) 1 | 0.7312 | 0.5944 | 0.01 5 | 0.5841 | 0.7131 | 0.05 10 | 0.5412 | 0.7481 | 0.09 50 | 0.5089 | 0.7672 | 0.41 100 | 0.5031 | 0.7698 | 0.80 200 | 0.5012 | 0.7712 | 1.54 500 | 0.5001 | 0.7720 | 3.82 RMSE vs Number of Trees n_estimators RMSE <text x="42" y="30" text-anchor="end" font-size="8" fill="#64748b">0.73</text> <text x="42" y="97" text-anchor="end" font-size="8" fill="#64748b">0.60</text> <text x="42" y="164" text-anchor="end" font-size="8" fill="#64748b">0.50</text> <text x="68" y="183" font-size="8" fill="#64748b">1</text> <text x="110" y="183" font-size="8" fill="#64748b">5</text> <text x="160" y="183" font-size="8" fill="#64748b">10</text> <text x="240" y="183" font-size="8" fill="#64748b">50</text> <text x="305" y="183" font-size="8" fill="#64748b">100</text> <text x="380" y="183" font-size="8" fill="#64748b">500</text> <polyline points="68,24 110,96 160,116 240,138 305,143 380,147 430,148" fill="none" stroke="#3b82f6" stroke-width="2.5"/> <line x1="305" y1="22" x2="305" y2="172" stroke="#f59e0b" stroke-width="1.5" stroke-dasharray="4,3"/> <text x="307" y="38" font-size="8" fill="#f59e0b">n=100</text> <text x="307" y="50" font-size="8" fill="#f59e0b">default</text> <circle cx="68" cy="24" r="4" fill="#3b82f6"/> <circle cx="110" cy="96" r="4" fill="#3b82f6"/> <circle cx="160" cy="116" r="4" fill="#3b82f6"/> <circle cx="240" cy="138" r="4" fill="#3b82f6"/> <circle cx="305" cy="143" r="4" fill="#3b82f6"/> <circle cx="380" cy="147" r="4" fill="#3b82f6"/> <circle cx="430" cy="148" r="4" fill="#3b82f6"/>

Adding trees from 1→10 is high-value (RMSE drops 0.19). Adding trees from 100→500 gains only 0.003 at 4.7× more training time. The default n=100 sits on the diminishing-returns knee.

max_features: Feature Subsampling Sensitivity

python
print(f"{'max_features':>14} | {'n_features':>10} | {'RMSE':>8}")
for mf in ['sqrt', 0.3, 0.5, 0.7, 1.0]:
    rf_mf = RandomForestRegressor(n_estimators=100, max_features=mf, random_state=42, n_jobs=-1)
    rf_mf.fit(X_train, y_train)
    rmse = np.sqrt(mean_squared_error(y_test, rf_mf.predict(X_test)))
    n_feat = int(mf * 8) if isinstance(mf, float) else int(8**0.5)
    print(f"{str(mf):>14} | {n_feat:>10} | {rmse:>8.4f}")
max_features | n_features | RMSE sqrt | 2 | 0.5031 0.3 | 2 | 0.5089 0.5 | 4 | 0.5041 0.7 | 5 | 0.5019 1.0 | 8 | 0.5183 ← pure bagging (no feature subsampling)

max_features=1.0 (pure bagging) gives the worst RMSE among the options — confirming that feature subsampling helps by decorrelating trees. sqrt (the default) is near-optimal.

OOB Score: Free Validation

python
rf_oob = RandomForestRegressor(n_estimators=100, oob_score=True,
                                max_features='sqrt', random_state=42)
rf_oob.fit(X_train, y_train)
print(f"OOB  R²: {rf_oob.oob_score_:.4f}")
print(f"Test R²: {r2_score(y_test, rf_oob.predict(X_test)):.4f}")
OOB R²: 0.7652 Test R²: 0.7698

OOB R² (0.765) is within 0.005 of the test R² (0.770). The OOB estimate is a reliable substitute for a held-out validation set — no CV needed. The small underestimate is because each sample is evaluated by fewer trees (only those for which it was OOB), but the approximation converges quickly with n_estimators ≥ 50.

Test Your Understanding

  1. For the hand-trace anchor (n=6 samples, p=2 features), each bootstrap draws 6 samples with replacement. The expected number of unique samples is . Verify this: compute and multiply by 6. How many OOB samples does each tree typically have?

  2. In the max_features sweep, max_features=1.0 (pure bagging) was worse than max_features='sqrt'. But in theory, considering more features at each split should find better individual splits. Why does giving each split access to all features make the ensemble worse, even though it makes individual trees better?

  3. The variance formula says as . In the sweep, RMSE at n=500 (0.5001) was barely better than n=100 (0.5031). What does this tell you about the minimum achievable RMSE from this approach — regardless of how many trees you add?

  4. OOB R²=0.765 underestimates test R²=0.770 slightly. Each OOB estimate uses only ~36.8% of trees. A sample that is OOB for k trees gets a prediction averaged over those k trees. At n_estimators=10, each sample is OOB for ~3–4 trees on average. At n_estimators=100, it's ~37 trees. How does this explain the OOB underestimate at small n, and why does OOB converge to the true generalization error as n_estimators increases?

  5. Random Forest's feature importance is computed as the total decrease in node impurity (MSE for regression) attributable to each feature, averaged over all trees. For the California Housing dataset (8 features: MedInc, HouseAge, AveRooms, AveBedrms, Population, AveOccup, Latitude, Longitude), which feature would you expect to have the highest importance, and why? (MedInc = median income in the block group.)

Comments (0)

No comments yet. Be the first to comment!

Leave a comment