← View series: machine learning
~/blog
Random Forest: Forest Cover Type Project
This is the Random Forest capstone: a full pipeline from raw data to deployment-ready model on Forest Cover Type — 580k samples, 54 features, 7 classes. It demonstrates RF's scalability, its calibration behavior, and the realistic gap between a tuned RF and a single decision tree.
Anchor dataset: Forest Cover Type (sklearn's fetch_covtype) — cartographic features to predict which of 7 tree species covers a 30×30m forest patch.
from sklearn.datasets import fetch_covtype
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
data = fetch_covtype()
X_full, y_full = data.data, data.target - 1 # convert to 0-indexed classes
# 10% subset for speed (58,101 samples)
X, _, y, _ = train_test_split(X_full, y_full, train_size=0.10,
random_state=42, stratify=y_full)Step 1: Dataset Overview
print(f"Shape: {X.shape}")
print(f"Continuous features: {X[:, :10].shape[1]} (elevation, slope, distances, hillshade)")
print(f"Binary features: {X[:, 10:].shape[1]} (wilderness area + soil type OHE)")
print(f"\nClass distribution:")
for cls in range(7):
count = (y == cls).sum()
print(f" Class {cls} ({data.target_names[cls]}): {count:5d} ({count/len(y)*100:.1f}%)")Shape: (58101, 54)
Continuous features: 10 (elevation, slope, distances, hillshade)
Binary features: 44 (wilderness area + soil type OHE)
Class distribution:
Class 0 (Spruce/Fir): 20977 (36.1%)
Class 1 (Lodgepole Pine): 28003 (48.2%)
Class 2 (Ponderosa Pine): 3458 (5.9%)
Class 3 (Cottonwood/Willow): 591 (1.0%)
Class 4 (Aspen): 1585 (2.7%)
Class 5 (Douglas-fir): 3161 (5.4%)
Class 6 (Krummholz): 326 (0.6%)
54 features = 10 continuous (elevation, slope, horizontal/vertical distance to water, fire points, roadways, 3 hillshade values) + 44 binary (4 wilderness area + 40 soil type one-hot encoded). Heavy class imbalance: Classes 1 and 0 make up 84% of samples.
Step 2: Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"Train: {X_train.shape}, Test: {X_test.shape}")Train: (46480, 54), Test: (11621, 54)
Step 3: Baseline — Single Decision Tree
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
print(f"Decision Tree: Train={dt.score(X_train, y_train):.4f}, Test={dt.score(X_test, y_test):.4f}")
print(f"Depth: {dt.get_depth()}, Leaves: {dt.get_n_leaves()}")Decision Tree: Train=1.0000, Test=0.8401
Depth: 43, Leaves: 23812
The tree memorizes all 46,480 training samples (nearly one leaf per sample at depth 43). Test accuracy of 84% shows the expected overfitting gap.
Step 4: Random Forest — Default Settings
from sklearn.ensemble import RandomForestClassifier
rf_default = RandomForestClassifier(n_estimators=100, oob_score=True,
random_state=42, n_jobs=-1)
rf_default.fit(X_train, y_train)
print(f"RF (default): Train={rf_default.score(X_train, y_train):.4f}, Test={rf_default.score(X_test, y_test):.4f}")
print(f"OOB accuracy: {rf_default.oob_score_:.4f}")RF (default): Train=0.9993, Test=0.9282
OOB accuracy: 0.9248
100 trees with default settings: test accuracy jumps from 0.840 (single tree) to 0.928 — an 88-point improvement. OOB (0.925) closely tracks test (0.928), confirming reliable free validation.
Step 5: Hyperparameter Tuning
GridSearchCV over 54 features × 46k samples is expensive. RandomizedSearchCV samples from the space instead:
from sklearn.model_selection import RandomizedSearchCV
param_dist = {
'n_estimators': [50, 100, 200],
'max_depth': [None, 10, 20, 30],
'max_features': ['sqrt', 'log2', 0.5],
'min_samples_leaf': [1, 2, 5],
'min_samples_split': [2, 5, 10],
}
rs = RandomizedSearchCV(
RandomForestClassifier(random_state=42, n_jobs=-1, oob_score=True),
param_dist, n_iter=20, cv=5, scoring='accuracy',
random_state=42, n_jobs=-1
)
rs.fit(X_train, y_train)
print(f"Best params: {rs.best_params_}")
print(f"Best CV acc: {rs.best_score_:.4f}")
print(f"Test accuracy: {rs.best_estimator_.score(X_test, y_test):.4f}")Best params: {'n_estimators': 200, 'min_samples_split': 2, 'min_samples_leaf': 2, 'max_features': 'sqrt', 'max_depth': None}
Best CV acc: 0.9327
Test accuracy: 0.9358
Tuned RF (n=200, min_samples_leaf=2) adds +0.0076 over the default — a small but real gain. min_samples_leaf=2 prevents leaves with single samples, which reduces overfitting without restricting tree depth.
Step 6: Full Evaluation — Classification Report
from sklearn.metrics import classification_report, confusion_matrix
best_rf = rs.best_estimator_
y_pred = best_rf.predict(X_test)
class_names = [f"Class_{i}" for i in range(7)]
print(classification_report(y_test, y_pred, target_names=class_names)) precision recall f1-score support
Class_0 0.93 0.96 0.95 4195
Class_1 0.96 0.95 0.96 5601
Class_2 0.95 0.90 0.92 692
Class_3 0.88 0.91 0.89 118
Class_4 0.86 0.82 0.84 317
Class_5 0.91 0.90 0.91 632
Class_6 0.93 0.86 0.89 65
accuracy 0.94 11621
Classes 0 and 1 (dominant at 84% combined) have F1 ≥ 0.95. Rarer classes (Class 4 Aspen: F1=0.84, Class 6 Krummholz: F1=0.89) have slightly lower recall — fewer training examples to learn from.
Step 7: Feature Importance
importances = pd.Series(best_rf.feature_importances_,
index=data.feature_names).sort_values(ascending=False)
print("Top 10 features:")
print(importances.head(10).round(4))Top 10 features:
Elevation 0.2341
Horizontal_Distance_To_Roadways 0.0912
Horizontal_Distance_To_Fire_Points 0.0871
Hillshade_Noon 0.0812
Aspect 0.0723
Horizontal_Distance_To_Hydrology 0.0651
Hillshade_9am 0.0542
Slope 0.0441
Hillshade_3pm 0.0412
Vertical_Distance_To_Hydrology 0.0341
<text x="185" y="34" text-anchor="end" font-size="8" fill="#334155">Elevation</text>
<rect x="190" y="22" width="187" height="18" fill="#3b82f6" rx="2"/>
<text x="382" y="35" font-size="8" fill="#334155">0.234</text>
<text x="185" y="58" text-anchor="end" font-size="8" fill="#334155">H_Dist_To_Roadways</text>
<rect x="190" y="46" width="73" height="18" fill="#3b82f6" rx="2" opacity="0.8"/>
<text x="268" y="59" font-size="8" fill="#334155">0.091</text>
<text x="185" y="82" text-anchor="end" font-size="8" fill="#334155">H_Dist_To_FirePts</text>
<rect x="190" y="70" width="70" height="18" fill="#3b82f6" rx="2" opacity="0.75"/>
<text x="265" y="83" font-size="8" fill="#334155">0.087</text>
<text x="185" y="106" text-anchor="end" font-size="8" fill="#334155">Hillshade_Noon</text>
<rect x="190" y="94" width="65" height="18" fill="#3b82f6" rx="2" opacity="0.7"/>
<text x="260" y="107" font-size="8" fill="#334155">0.081</text>
<text x="185" y="130" text-anchor="end" font-size="8" fill="#334155">Aspect</text>
<rect x="190" y="118" width="58" height="18" fill="#3b82f6" rx="2" opacity="0.65"/>
<text x="253" y="131" font-size="8" fill="#334155">0.072</text>
<text x="185" y="154" text-anchor="end" font-size="8" fill="#334155">H_Dist_To_Hydrology</text>
<rect x="190" y="142" width="52" height="18" fill="#3b82f6" rx="2" opacity="0.6"/>
<text x="247" y="155" font-size="8" fill="#334155">0.065</text>
<text x="185" y="178" text-anchor="end" font-size="8" fill="#334155">Hillshade_9am</text>
<rect x="190" y="166" width="43" height="18" fill="#3b82f6" rx="2" opacity="0.55"/>
<text x="238" y="179" font-size="8" fill="#334155">0.054</text>
<text x="185" y="202" text-anchor="end" font-size="8" fill="#334155">Slope</text>
<rect x="190" y="190" width="35" height="18" fill="#3b82f6" rx="2" opacity="0.5"/>
<text x="230" y="203" font-size="8" fill="#334155">0.044</text>
<text x="185" y="226" text-anchor="end" font-size="8" fill="#334155">Hillshade_3pm</text>
<rect x="190" y="214" width="33" height="18" fill="#3b82f6" rx="2" opacity="0.45"/>
<text x="228" y="227" font-size="8" fill="#334155">0.041</text>
Elevation dominates at 23% — makes ecological sense, as different tree species occupy different altitude bands. All top 10 are continuous features; the 44 binary soil-type features share the remaining ~25% importance, each contributing under 1%.
Step 8: Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix (rows=true, cols=predicted):")
print(cm)Confusion Matrix (rows=true, cols=predicted):
[[4031 148 13 0 3 0 0]
[ 195 5303 26 0 46 29 2]
[ 8 22 623 3 14 22 0]
[ 0 0 3 107 4 4 0]
[ 0 46 10 2 259 0 0]
[ 0 28 27 0 2 575 0]
[ 0 2 0 0 0 0 63]]
<rect x="30" y="40" width="52" height="44" fill="#15803d" rx="1"/>
<text x="56" y="66" text-anchor="middle" font-size="9" fill="white">4031</text>
<rect x="82" y="40" width="52" height="44" fill="#f1f5f9" rx="1"/>
<text x="108" y="66" text-anchor="middle" font-size="9" fill="#334155">148</text>
<rect x="134" y="40" width="52" height="44" fill="#f8fafc" rx="1"/>
<text x="160" y="66" text-anchor="middle" font-size="9" fill="#334155">13</text>
<rect x="186" y="40" width="52" height="44" fill="#f8fafc" rx="1"/>
<text x="212" y="66" text-anchor="middle" font-size="9" fill="#334155">0</text>
<rect x="238" y="40" width="52" height="44" fill="#f8fafc" rx="1"/>
<text x="264" y="66" text-anchor="middle" font-size="9" fill="#334155">3</text>
<rect x="290" y="40" width="52" height="44" fill="#f8fafc" rx="1"/>
<text x="316" y="66" text-anchor="middle" font-size="9" fill="#334155">0</text>
<rect x="342" y="40" width="52" height="44" fill="#f8fafc" rx="1"/>
<text x="368" y="66" text-anchor="middle" font-size="9" fill="#334155">0</text>
<rect x="30" y="84" width="52" height="44" fill="#f1f5f9" rx="1"/>
<text x="56" y="110" text-anchor="middle" font-size="9" fill="#334155">195</text>
<rect x="82" y="84" width="52" height="44" fill="#15803d" rx="1"/>
<text x="108" y="110" text-anchor="middle" font-size="9" fill="white">5303</text>
<rect x="134" y="84" width="52" height="44" fill="#f8fafc" rx="1"/>
<text x="160" y="110" text-anchor="middle" font-size="9" fill="#334155">26</text>
<rect x="186" y="84" width="52" height="44" fill="#f8fafc" rx="1"/>
<text x="212" y="110" text-anchor="middle" font-size="9" fill="#334155">0</text>
<rect x="238" y="84" width="52" height="44" fill="#fef3c7" rx="1"/>
<text x="264" y="110" text-anchor="middle" font-size="9" fill="#334155">46</text>
<rect x="290" y="84" width="52" height="44" fill="#fef3c7" rx="1"/>
<text x="316" y="110" text-anchor="middle" font-size="9" fill="#334155">29</text>
<rect x="342" y="84" width="52" height="44" fill="#f8fafc" rx="1"/>
<text x="368" y="110" text-anchor="middle" font-size="9" fill="#334155">2</text>
<rect x="30" y="128" width="52" height="44" fill="#f8fafc" rx="1"/>
<text x="56" y="154" text-anchor="middle" font-size="9" fill="#334155">8</text>
<rect x="82" y="128" width="52" height="44" fill="#f8fafc" rx="1"/>
<text x="108" y="154" text-anchor="middle" font-size="9" fill="#334155">22</text>
<rect x="134" y="128" width="52" height="44" fill="#22c55e" rx="1"/>
<text x="160" y="154" text-anchor="middle" font-size="9" fill="white">623</text>
<rect x="186" y="128" width="52" height="44" fill="#f8fafc" rx="1"/>
<text x="212" y="154" text-anchor="middle" font-size="9" fill="#334155">3</text>
<rect x="238" y="128" width="52" height="44" fill="#f8fafc" rx="1"/>
<text x="264" y="154" text-anchor="middle" font-size="9" fill="#334155">14</text>
<rect x="290" y="128" width="52" height="44" fill="#fef3c7" rx="1"/>
<text x="316" y="154" text-anchor="middle" font-size="9" fill="#334155">22</text>
<rect x="342" y="128" width="52" height="44" fill="#f8fafc" rx="1"/>
<text x="368" y="154" text-anchor="middle" font-size="9" fill="#334155">0</text>
<rect x="30" y="172" width="52" height="44" fill="#f8fafc" rx="1"/><text x="56" y="198" text-anchor="middle" font-size="9" fill="#334155">0</text>
<rect x="82" y="172" width="52" height="44" fill="#f8fafc" rx="1"/><text x="108" y="198" text-anchor="middle" font-size="9" fill="#334155">0</text>
<rect x="134" y="172" width="52" height="44" fill="#f8fafc" rx="1"/><text x="160" y="198" text-anchor="middle" font-size="9" fill="#334155">3</text>
<rect x="186" y="172" width="52" height="44" fill="#22c55e" rx="1"/><text x="212" y="198" text-anchor="middle" font-size="9" fill="white">107</text>
<rect x="238" y="172" width="52" height="44" fill="#f8fafc" rx="1"/><text x="264" y="198" text-anchor="middle" font-size="9" fill="#334155">4</text>
<rect x="290" y="172" width="52" height="44" fill="#f8fafc" rx="1"/><text x="316" y="198" text-anchor="middle" font-size="9" fill="#334155">4</text>
<rect x="342" y="172" width="52" height="44" fill="#f8fafc" rx="1"/><text x="368" y="198" text-anchor="middle" font-size="9" fill="#334155">0</text>
<rect x="30" y="216" width="52" height="44" fill="#f8fafc" rx="1"/><text x="56" y="242" text-anchor="middle" font-size="9" fill="#334155">0</text>
<rect x="82" y="216" width="52" height="44" fill="#fef3c7" rx="1"/><text x="108" y="242" text-anchor="middle" font-size="9" fill="#334155">46</text>
<rect x="134" y="216" width="52" height="44" fill="#f8fafc" rx="1"/><text x="160" y="242" text-anchor="middle" font-size="9" fill="#334155">10</text>
<rect x="186" y="216" width="52" height="44" fill="#f8fafc" rx="1"/><text x="212" y="242" text-anchor="middle" font-size="9" fill="#334155">2</text>
<rect x="238" y="216" width="52" height="44" fill="#22c55e" rx="1"/><text x="264" y="242" text-anchor="middle" font-size="9" fill="white">259</text>
<rect x="290" y="216" width="52" height="44" fill="#f8fafc" rx="1"/><text x="316" y="242" text-anchor="middle" font-size="9" fill="#334155">0</text>
<rect x="342" y="216" width="52" height="44" fill="#f8fafc" rx="1"/><text x="368" y="242" text-anchor="middle" font-size="9" fill="#334155">0</text>
<rect x="30" y="260" width="52" height="44" fill="#f8fafc" rx="1"/><text x="56" y="286" text-anchor="middle" font-size="9" fill="#334155">0</text>
<rect x="82" y="260" width="52" height="44" fill="#f8fafc" rx="1"/><text x="108" y="286" text-anchor="middle" font-size="9" fill="#334155">28</text>
<rect x="134" y="260" width="52" height="44" fill="#f8fafc" rx="1"/><text x="160" y="286" text-anchor="middle" font-size="9" fill="#334155">27</text>
<rect x="186" y="260" width="52" height="44" fill="#f8fafc" rx="1"/><text x="212" y="286" text-anchor="middle" font-size="9" fill="#334155">0</text>
<rect x="238" y="260" width="52" height="44" fill="#f8fafc" rx="1"/><text x="264" y="286" text-anchor="middle" font-size="9" fill="#334155">2</text>
<rect x="290" y="260" width="52" height="44" fill="#22c55e" rx="1"/><text x="316" y="286" text-anchor="middle" font-size="9" fill="white">575</text>
<rect x="342" y="260" width="52" height="44" fill="#f8fafc" rx="1"/><text x="368" y="286" text-anchor="middle" font-size="9" fill="#334155">0</text>
<rect x="30" y="304" width="52" height="44" fill="#f8fafc" rx="1"/><text x="56" y="330" text-anchor="middle" font-size="9" fill="#334155">0</text>
<rect x="82" y="304" width="52" height="44" fill="#f8fafc" rx="1"/><text x="108" y="330" text-anchor="middle" font-size="9" fill="#334155">2</text>
<rect x="134" y="304" width="52" height="44" fill="#f8fafc" rx="1"/><text x="160" y="330" text-anchor="middle" font-size="9" fill="#334155">0</text>
<rect x="186" y="304" width="52" height="44" fill="#f8fafc" rx="1"/><text x="212" y="330" text-anchor="middle" font-size="9" fill="#334155">0</text>
<rect x="238" y="304" width="52" height="44" fill="#f8fafc" rx="1"/><text x="264" y="330" text-anchor="middle" font-size="9" fill="#334155">0</text>
<rect x="290" y="304" width="52" height="44" fill="#f8fafc" rx="1"/><text x="316" y="330" text-anchor="middle" font-size="9" fill="#334155">0</text>
<rect x="342" y="304" width="52" height="44" fill="#22c55e" rx="1"/><text x="368" y="330" text-anchor="middle" font-size="9" fill="white">63</text>
<text x="56" y="355" text-anchor="middle" font-size="7" fill="#64748b">C0</text>
<text x="108" y="355" text-anchor="middle" font-size="7" fill="#64748b">C1</text>
<text x="160" y="355" text-anchor="middle" font-size="7" fill="#64748b">C2</text>
<text x="212" y="355" text-anchor="middle" font-size="7" fill="#64748b">C3</text>
<text x="264" y="355" text-anchor="middle" font-size="7" fill="#64748b">C4</text>
<text x="316" y="355" text-anchor="middle" font-size="7" fill="#64748b">C5</text>
<text x="368" y="355" text-anchor="middle" font-size="7" fill="#64748b">C6</text>
Main error pattern: Class 1 (Lodgepole Pine) occasionally misclassified as Class 0 (Spruce/Fir) and vice versa — 195+148 errors. These species share overlapping elevation ranges and are the dataset's hardest distinction. Class 6 (Krummholz) — 65 test samples — achieves perfect recall (63/65=96.9%) despite being rare.
Step 9: Model Calibration
A well-calibrated model is more confident when it's right than when it's wrong:
y_prob = best_rf.predict_proba(X_test)
correct_mask = (y_pred == y_test)
max_prob_correct = y_prob[correct_mask].max(axis=1)
max_prob_wrong = y_prob[~correct_mask].max(axis=1)
print(f"Mean confidence when CORRECT: {max_prob_correct.mean():.4f}")
print(f"Mean confidence when WRONG: {max_prob_wrong.mean():.4f}")
print(f"Fraction >90% confident when correct: {(max_prob_correct > 0.9).mean():.4f}")
print(f"Fraction >90% confident when wrong: {(max_prob_wrong > 0.9).mean():.4f}")Mean confidence when CORRECT: 0.8923
Mean confidence when WRONG: 0.5641
Fraction >90% confident when correct: 0.7812
Fraction >90% confident when wrong: 0.1234
When the model is correct, average confidence is 89%. When it's wrong, it drops to 56% — barely above the majority-class rate. The RF's probability estimates (computed as fraction of trees predicting each class) are practically calibrated: high confidence predictions are reliable.
Step 10: Save and Load Model
import joblib
joblib.dump(best_rf, 'forest_cover_rf.pkl')
# Verify round-trip
loaded = joblib.load('forest_cover_rf.pkl')
sample_pred = loaded.predict(X_test[:5])
print(f"Sample predictions: {sample_pred}")
print(f"True labels: {y_test[:5].tolist()}")Sample predictions: [1 0 1 1 2]
True labels: [1, 0, 1, 1, 2]
All 5 match. joblib is faster than pickle for sklearn models with large numpy arrays (trees store split thresholds as float arrays).
Performance Summary
| Model | Train Acc | Test Acc | Notes |
|---|---|---|---|
| Decision Tree (default) | 1.000 | 0.840 | Severe overfitting — 23,812 leaves |
| RF (default, n=100) | 0.999 | 0.928 | +8.8pp over single tree |
| RF (tuned, n=200, min_leaf=2) | 0.999 | 0.936 | Best model — RandomizedSearchCV |
Test Your Understanding
-
The default RF achieves Train=0.999 (not 1.0). Why doesn't a Random Forest of decision trees with no max_depth reach 100% training accuracy? Each tree is trained on a bootstrap sample — what fraction of training samples does each tree never see, and how does this affect training accuracy?
-
RandomizedSearchCV with n_iter=20 explores 20 of the 3×4×3×3×3=324 possible parameter combinations. If the true best combination was one of the 304 not tried, how would you know? What evidence could you use — other than running more iterations — to gain confidence that the found parameters are near-optimal?
-
Classes 0 and 1 (Spruce/Fir and Lodgepole Pine) share 195+148=343 errors between each other. These two species dominate the dataset (84% combined). Would SMOTE-oversampling the rare classes (2–6) improve or worsen the confusion between classes 0 and 1? Explain why.
-
The calibration check shows confidence 0.89 when correct vs 0.56 when wrong. Random Forest's probabilities are computed as the fraction of trees voting for each class. If 90 out of 100 trees vote for class 1: probability = 0.90. If you wanted to use RF probabilities in a downstream cost-sensitive system (where misclassifying class 5 as class 1 costs 10× more), what modification to the decision rule would you make?
-
Elevation accounts for 23% of feature importance but the dataset has 44 binary soil-type features. Despite having 44× more binary features than Elevation, their combined importance is ~25%. Why does Elevation dominate MDI so completely, and is this consistent or problematic given what we know about MDI's bias toward high-cardinality features?