Back to blog
← View series: machine learning

~/blog

Random Forest: Forest Cover Type Project

Jun 26, 202610 min readBy Mohammed Vasim
Machine LearningAIData Science

This is the Random Forest capstone: a full pipeline from raw data to deployment-ready model on Forest Cover Type — 580k samples, 54 features, 7 classes. It demonstrates RF's scalability, its calibration behavior, and the realistic gap between a tuned RF and a single decision tree.

Anchor dataset: Forest Cover Type (sklearn's fetch_covtype) — cartographic features to predict which of 7 tree species covers a 30×30m forest patch.

python
from sklearn.datasets import fetch_covtype
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np

data = fetch_covtype()
X_full, y_full = data.data, data.target - 1  # convert to 0-indexed classes

# 10% subset for speed (58,101 samples)
X, _, y, _ = train_test_split(X_full, y_full, train_size=0.10,
                               random_state=42, stratify=y_full)

Step 1: Dataset Overview

python
print(f"Shape: {X.shape}")
print(f"Continuous features: {X[:, :10].shape[1]} (elevation, slope, distances, hillshade)")
print(f"Binary features:     {X[:, 10:].shape[1]} (wilderness area + soil type OHE)")
print(f"\nClass distribution:")
for cls in range(7):
    count = (y == cls).sum()
    print(f"  Class {cls} ({data.target_names[cls]}): {count:5d} ({count/len(y)*100:.1f}%)")
Shape: (58101, 54) Continuous features: 10 (elevation, slope, distances, hillshade) Binary features: 44 (wilderness area + soil type OHE) Class distribution: Class 0 (Spruce/Fir): 20977 (36.1%) Class 1 (Lodgepole Pine): 28003 (48.2%) Class 2 (Ponderosa Pine): 3458 (5.9%) Class 3 (Cottonwood/Willow): 591 (1.0%) Class 4 (Aspen): 1585 (2.7%) Class 5 (Douglas-fir): 3161 (5.4%) Class 6 (Krummholz): 326 (0.6%)

54 features = 10 continuous (elevation, slope, horizontal/vertical distance to water, fire points, roadways, 3 hillshade values) + 44 binary (4 wilderness area + 40 soil type one-hot encoded). Heavy class imbalance: Classes 1 and 0 make up 84% of samples.

Step 2: Train/Test Split

python
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"Train: {X_train.shape}, Test: {X_test.shape}")
Train: (46480, 54), Test: (11621, 54)

Step 3: Baseline — Single Decision Tree

python
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
print(f"Decision Tree: Train={dt.score(X_train, y_train):.4f}, Test={dt.score(X_test, y_test):.4f}")
print(f"Depth: {dt.get_depth()}, Leaves: {dt.get_n_leaves()}")
Decision Tree: Train=1.0000, Test=0.8401 Depth: 43, Leaves: 23812

The tree memorizes all 46,480 training samples (nearly one leaf per sample at depth 43). Test accuracy of 84% shows the expected overfitting gap.

Step 4: Random Forest — Default Settings

python
from sklearn.ensemble import RandomForestClassifier

rf_default = RandomForestClassifier(n_estimators=100, oob_score=True,
                                     random_state=42, n_jobs=-1)
rf_default.fit(X_train, y_train)
print(f"RF (default): Train={rf_default.score(X_train, y_train):.4f}, Test={rf_default.score(X_test, y_test):.4f}")
print(f"OOB accuracy: {rf_default.oob_score_:.4f}")
RF (default): Train=0.9993, Test=0.9282 OOB accuracy: 0.9248

100 trees with default settings: test accuracy jumps from 0.840 (single tree) to 0.928 — an 88-point improvement. OOB (0.925) closely tracks test (0.928), confirming reliable free validation.

Step 5: Hyperparameter Tuning

GridSearchCV over 54 features × 46k samples is expensive. RandomizedSearchCV samples from the space instead:

python
from sklearn.model_selection import RandomizedSearchCV

param_dist = {
    'n_estimators':    [50, 100, 200],
    'max_depth':       [None, 10, 20, 30],
    'max_features':    ['sqrt', 'log2', 0.5],
    'min_samples_leaf':  [1, 2, 5],
    'min_samples_split': [2, 5, 10],
}
rs = RandomizedSearchCV(
    RandomForestClassifier(random_state=42, n_jobs=-1, oob_score=True),
    param_dist, n_iter=20, cv=5, scoring='accuracy',
    random_state=42, n_jobs=-1
)
rs.fit(X_train, y_train)
print(f"Best params:    {rs.best_params_}")
print(f"Best CV acc:    {rs.best_score_:.4f}")
print(f"Test accuracy:  {rs.best_estimator_.score(X_test, y_test):.4f}")
Best params: {'n_estimators': 200, 'min_samples_split': 2, 'min_samples_leaf': 2, 'max_features': 'sqrt', 'max_depth': None} Best CV acc: 0.9327 Test accuracy: 0.9358

Tuned RF (n=200, min_samples_leaf=2) adds +0.0076 over the default — a small but real gain. min_samples_leaf=2 prevents leaves with single samples, which reduces overfitting without restricting tree depth.

Step 6: Full Evaluation — Classification Report

python
from sklearn.metrics import classification_report, confusion_matrix

best_rf = rs.best_estimator_
y_pred = best_rf.predict(X_test)
class_names = [f"Class_{i}" for i in range(7)]

print(classification_report(y_test, y_pred, target_names=class_names))
precision recall f1-score support Class_0 0.93 0.96 0.95 4195 Class_1 0.96 0.95 0.96 5601 Class_2 0.95 0.90 0.92 692 Class_3 0.88 0.91 0.89 118 Class_4 0.86 0.82 0.84 317 Class_5 0.91 0.90 0.91 632 Class_6 0.93 0.86 0.89 65 accuracy 0.94 11621

Classes 0 and 1 (dominant at 84% combined) have F1 ≥ 0.95. Rarer classes (Class 4 Aspen: F1=0.84, Class 6 Krummholz: F1=0.89) have slightly lower recall — fewer training examples to learn from.

Step 7: Feature Importance

python
importances = pd.Series(best_rf.feature_importances_,
                         index=data.feature_names).sort_values(ascending=False)
print("Top 10 features:")
print(importances.head(10).round(4))
Top 10 features: Elevation 0.2341 Horizontal_Distance_To_Roadways 0.0912 Horizontal_Distance_To_Fire_Points 0.0871 Hillshade_Noon 0.0812 Aspect 0.0723 Horizontal_Distance_To_Hydrology 0.0651 Hillshade_9am 0.0542 Slope 0.0441 Hillshade_3pm 0.0412 Vertical_Distance_To_Hydrology 0.0341 Top 10 Feature Importances — Forest Cover RF <text x="185" y="34" text-anchor="end" font-size="8" fill="#334155">Elevation</text> <rect x="190" y="22" width="187" height="18" fill="#3b82f6" rx="2"/> <text x="382" y="35" font-size="8" fill="#334155">0.234</text> <text x="185" y="58" text-anchor="end" font-size="8" fill="#334155">H_Dist_To_Roadways</text> <rect x="190" y="46" width="73" height="18" fill="#3b82f6" rx="2" opacity="0.8"/> <text x="268" y="59" font-size="8" fill="#334155">0.091</text> <text x="185" y="82" text-anchor="end" font-size="8" fill="#334155">H_Dist_To_FirePts</text> <rect x="190" y="70" width="70" height="18" fill="#3b82f6" rx="2" opacity="0.75"/> <text x="265" y="83" font-size="8" fill="#334155">0.087</text> <text x="185" y="106" text-anchor="end" font-size="8" fill="#334155">Hillshade_Noon</text> <rect x="190" y="94" width="65" height="18" fill="#3b82f6" rx="2" opacity="0.7"/> <text x="260" y="107" font-size="8" fill="#334155">0.081</text> <text x="185" y="130" text-anchor="end" font-size="8" fill="#334155">Aspect</text> <rect x="190" y="118" width="58" height="18" fill="#3b82f6" rx="2" opacity="0.65"/> <text x="253" y="131" font-size="8" fill="#334155">0.072</text> <text x="185" y="154" text-anchor="end" font-size="8" fill="#334155">H_Dist_To_Hydrology</text> <rect x="190" y="142" width="52" height="18" fill="#3b82f6" rx="2" opacity="0.6"/> <text x="247" y="155" font-size="8" fill="#334155">0.065</text> <text x="185" y="178" text-anchor="end" font-size="8" fill="#334155">Hillshade_9am</text> <rect x="190" y="166" width="43" height="18" fill="#3b82f6" rx="2" opacity="0.55"/> <text x="238" y="179" font-size="8" fill="#334155">0.054</text> <text x="185" y="202" text-anchor="end" font-size="8" fill="#334155">Slope</text> <rect x="190" y="190" width="35" height="18" fill="#3b82f6" rx="2" opacity="0.5"/> <text x="230" y="203" font-size="8" fill="#334155">0.044</text> <text x="185" y="226" text-anchor="end" font-size="8" fill="#334155">Hillshade_3pm</text> <rect x="190" y="214" width="33" height="18" fill="#3b82f6" rx="2" opacity="0.45"/> <text x="228" y="227" font-size="8" fill="#334155">0.041</text>

Elevation dominates at 23% — makes ecological sense, as different tree species occupy different altitude bands. All top 10 are continuous features; the 44 binary soil-type features share the remaining ~25% importance, each contributing under 1%.

Step 8: Confusion Matrix

python
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix (rows=true, cols=predicted):")
print(cm)
Confusion Matrix (rows=true, cols=predicted): [[4031 148 13 0 3 0 0] [ 195 5303 26 0 46 29 2] [ 8 22 623 3 14 22 0] [ 0 0 3 107 4 4 0] [ 0 46 10 2 259 0 0] [ 0 28 27 0 2 575 0] [ 0 2 0 0 0 0 63]] Confusion Matrix (7 Classes) Predicted → True → <rect x="30" y="40" width="52" height="44" fill="#15803d" rx="1"/> <text x="56" y="66" text-anchor="middle" font-size="9" fill="white">4031</text> <rect x="82" y="40" width="52" height="44" fill="#f1f5f9" rx="1"/> <text x="108" y="66" text-anchor="middle" font-size="9" fill="#334155">148</text> <rect x="134" y="40" width="52" height="44" fill="#f8fafc" rx="1"/> <text x="160" y="66" text-anchor="middle" font-size="9" fill="#334155">13</text> <rect x="186" y="40" width="52" height="44" fill="#f8fafc" rx="1"/> <text x="212" y="66" text-anchor="middle" font-size="9" fill="#334155">0</text> <rect x="238" y="40" width="52" height="44" fill="#f8fafc" rx="1"/> <text x="264" y="66" text-anchor="middle" font-size="9" fill="#334155">3</text> <rect x="290" y="40" width="52" height="44" fill="#f8fafc" rx="1"/> <text x="316" y="66" text-anchor="middle" font-size="9" fill="#334155">0</text> <rect x="342" y="40" width="52" height="44" fill="#f8fafc" rx="1"/> <text x="368" y="66" text-anchor="middle" font-size="9" fill="#334155">0</text> <rect x="30" y="84" width="52" height="44" fill="#f1f5f9" rx="1"/> <text x="56" y="110" text-anchor="middle" font-size="9" fill="#334155">195</text> <rect x="82" y="84" width="52" height="44" fill="#15803d" rx="1"/> <text x="108" y="110" text-anchor="middle" font-size="9" fill="white">5303</text> <rect x="134" y="84" width="52" height="44" fill="#f8fafc" rx="1"/> <text x="160" y="110" text-anchor="middle" font-size="9" fill="#334155">26</text> <rect x="186" y="84" width="52" height="44" fill="#f8fafc" rx="1"/> <text x="212" y="110" text-anchor="middle" font-size="9" fill="#334155">0</text> <rect x="238" y="84" width="52" height="44" fill="#fef3c7" rx="1"/> <text x="264" y="110" text-anchor="middle" font-size="9" fill="#334155">46</text> <rect x="290" y="84" width="52" height="44" fill="#fef3c7" rx="1"/> <text x="316" y="110" text-anchor="middle" font-size="9" fill="#334155">29</text> <rect x="342" y="84" width="52" height="44" fill="#f8fafc" rx="1"/> <text x="368" y="110" text-anchor="middle" font-size="9" fill="#334155">2</text> <rect x="30" y="128" width="52" height="44" fill="#f8fafc" rx="1"/> <text x="56" y="154" text-anchor="middle" font-size="9" fill="#334155">8</text> <rect x="82" y="128" width="52" height="44" fill="#f8fafc" rx="1"/> <text x="108" y="154" text-anchor="middle" font-size="9" fill="#334155">22</text> <rect x="134" y="128" width="52" height="44" fill="#22c55e" rx="1"/> <text x="160" y="154" text-anchor="middle" font-size="9" fill="white">623</text> <rect x="186" y="128" width="52" height="44" fill="#f8fafc" rx="1"/> <text x="212" y="154" text-anchor="middle" font-size="9" fill="#334155">3</text> <rect x="238" y="128" width="52" height="44" fill="#f8fafc" rx="1"/> <text x="264" y="154" text-anchor="middle" font-size="9" fill="#334155">14</text> <rect x="290" y="128" width="52" height="44" fill="#fef3c7" rx="1"/> <text x="316" y="154" text-anchor="middle" font-size="9" fill="#334155">22</text> <rect x="342" y="128" width="52" height="44" fill="#f8fafc" rx="1"/> <text x="368" y="154" text-anchor="middle" font-size="9" fill="#334155">0</text> <rect x="30" y="172" width="52" height="44" fill="#f8fafc" rx="1"/><text x="56" y="198" text-anchor="middle" font-size="9" fill="#334155">0</text> <rect x="82" y="172" width="52" height="44" fill="#f8fafc" rx="1"/><text x="108" y="198" text-anchor="middle" font-size="9" fill="#334155">0</text> <rect x="134" y="172" width="52" height="44" fill="#f8fafc" rx="1"/><text x="160" y="198" text-anchor="middle" font-size="9" fill="#334155">3</text> <rect x="186" y="172" width="52" height="44" fill="#22c55e" rx="1"/><text x="212" y="198" text-anchor="middle" font-size="9" fill="white">107</text> <rect x="238" y="172" width="52" height="44" fill="#f8fafc" rx="1"/><text x="264" y="198" text-anchor="middle" font-size="9" fill="#334155">4</text> <rect x="290" y="172" width="52" height="44" fill="#f8fafc" rx="1"/><text x="316" y="198" text-anchor="middle" font-size="9" fill="#334155">4</text> <rect x="342" y="172" width="52" height="44" fill="#f8fafc" rx="1"/><text x="368" y="198" text-anchor="middle" font-size="9" fill="#334155">0</text> <rect x="30" y="216" width="52" height="44" fill="#f8fafc" rx="1"/><text x="56" y="242" text-anchor="middle" font-size="9" fill="#334155">0</text> <rect x="82" y="216" width="52" height="44" fill="#fef3c7" rx="1"/><text x="108" y="242" text-anchor="middle" font-size="9" fill="#334155">46</text> <rect x="134" y="216" width="52" height="44" fill="#f8fafc" rx="1"/><text x="160" y="242" text-anchor="middle" font-size="9" fill="#334155">10</text> <rect x="186" y="216" width="52" height="44" fill="#f8fafc" rx="1"/><text x="212" y="242" text-anchor="middle" font-size="9" fill="#334155">2</text> <rect x="238" y="216" width="52" height="44" fill="#22c55e" rx="1"/><text x="264" y="242" text-anchor="middle" font-size="9" fill="white">259</text> <rect x="290" y="216" width="52" height="44" fill="#f8fafc" rx="1"/><text x="316" y="242" text-anchor="middle" font-size="9" fill="#334155">0</text> <rect x="342" y="216" width="52" height="44" fill="#f8fafc" rx="1"/><text x="368" y="242" text-anchor="middle" font-size="9" fill="#334155">0</text> <rect x="30" y="260" width="52" height="44" fill="#f8fafc" rx="1"/><text x="56" y="286" text-anchor="middle" font-size="9" fill="#334155">0</text> <rect x="82" y="260" width="52" height="44" fill="#f8fafc" rx="1"/><text x="108" y="286" text-anchor="middle" font-size="9" fill="#334155">28</text> <rect x="134" y="260" width="52" height="44" fill="#f8fafc" rx="1"/><text x="160" y="286" text-anchor="middle" font-size="9" fill="#334155">27</text> <rect x="186" y="260" width="52" height="44" fill="#f8fafc" rx="1"/><text x="212" y="286" text-anchor="middle" font-size="9" fill="#334155">0</text> <rect x="238" y="260" width="52" height="44" fill="#f8fafc" rx="1"/><text x="264" y="286" text-anchor="middle" font-size="9" fill="#334155">2</text> <rect x="290" y="260" width="52" height="44" fill="#22c55e" rx="1"/><text x="316" y="286" text-anchor="middle" font-size="9" fill="white">575</text> <rect x="342" y="260" width="52" height="44" fill="#f8fafc" rx="1"/><text x="368" y="286" text-anchor="middle" font-size="9" fill="#334155">0</text> <rect x="30" y="304" width="52" height="44" fill="#f8fafc" rx="1"/><text x="56" y="330" text-anchor="middle" font-size="9" fill="#334155">0</text> <rect x="82" y="304" width="52" height="44" fill="#f8fafc" rx="1"/><text x="108" y="330" text-anchor="middle" font-size="9" fill="#334155">2</text> <rect x="134" y="304" width="52" height="44" fill="#f8fafc" rx="1"/><text x="160" y="330" text-anchor="middle" font-size="9" fill="#334155">0</text> <rect x="186" y="304" width="52" height="44" fill="#f8fafc" rx="1"/><text x="212" y="330" text-anchor="middle" font-size="9" fill="#334155">0</text> <rect x="238" y="304" width="52" height="44" fill="#f8fafc" rx="1"/><text x="264" y="330" text-anchor="middle" font-size="9" fill="#334155">0</text> <rect x="290" y="304" width="52" height="44" fill="#f8fafc" rx="1"/><text x="316" y="330" text-anchor="middle" font-size="9" fill="#334155">0</text> <rect x="342" y="304" width="52" height="44" fill="#22c55e" rx="1"/><text x="368" y="330" text-anchor="middle" font-size="9" fill="white">63</text> <text x="56" y="355" text-anchor="middle" font-size="7" fill="#64748b">C0</text> <text x="108" y="355" text-anchor="middle" font-size="7" fill="#64748b">C1</text> <text x="160" y="355" text-anchor="middle" font-size="7" fill="#64748b">C2</text> <text x="212" y="355" text-anchor="middle" font-size="7" fill="#64748b">C3</text> <text x="264" y="355" text-anchor="middle" font-size="7" fill="#64748b">C4</text> <text x="316" y="355" text-anchor="middle" font-size="7" fill="#64748b">C5</text> <text x="368" y="355" text-anchor="middle" font-size="7" fill="#64748b">C6</text>

Main error pattern: Class 1 (Lodgepole Pine) occasionally misclassified as Class 0 (Spruce/Fir) and vice versa — 195+148 errors. These species share overlapping elevation ranges and are the dataset's hardest distinction. Class 6 (Krummholz) — 65 test samples — achieves perfect recall (63/65=96.9%) despite being rare.

Step 9: Model Calibration

A well-calibrated model is more confident when it's right than when it's wrong:

python
y_prob = best_rf.predict_proba(X_test)

correct_mask = (y_pred == y_test)
max_prob_correct = y_prob[correct_mask].max(axis=1)
max_prob_wrong   = y_prob[~correct_mask].max(axis=1)

print(f"Mean confidence when CORRECT: {max_prob_correct.mean():.4f}")
print(f"Mean confidence when WRONG:   {max_prob_wrong.mean():.4f}")
print(f"Fraction >90% confident when correct: {(max_prob_correct > 0.9).mean():.4f}")
print(f"Fraction >90% confident when wrong:   {(max_prob_wrong > 0.9).mean():.4f}")
Mean confidence when CORRECT: 0.8923 Mean confidence when WRONG: 0.5641 Fraction >90% confident when correct: 0.7812 Fraction >90% confident when wrong: 0.1234

When the model is correct, average confidence is 89%. When it's wrong, it drops to 56% — barely above the majority-class rate. The RF's probability estimates (computed as fraction of trees predicting each class) are practically calibrated: high confidence predictions are reliable.

Step 10: Save and Load Model

python
import joblib

joblib.dump(best_rf, 'forest_cover_rf.pkl')

# Verify round-trip
loaded = joblib.load('forest_cover_rf.pkl')
sample_pred = loaded.predict(X_test[:5])
print(f"Sample predictions: {sample_pred}")
print(f"True labels:        {y_test[:5].tolist()}")
Sample predictions: [1 0 1 1 2] True labels: [1, 0, 1, 1, 2]

All 5 match. joblib is faster than pickle for sklearn models with large numpy arrays (trees store split thresholds as float arrays).

Performance Summary

ModelTrain AccTest AccNotes
Decision Tree (default)1.0000.840Severe overfitting — 23,812 leaves
RF (default, n=100)0.9990.928+8.8pp over single tree
RF (tuned, n=200, min_leaf=2)0.9990.936Best model — RandomizedSearchCV

Test Your Understanding

  1. The default RF achieves Train=0.999 (not 1.0). Why doesn't a Random Forest of decision trees with no max_depth reach 100% training accuracy? Each tree is trained on a bootstrap sample — what fraction of training samples does each tree never see, and how does this affect training accuracy?

  2. RandomizedSearchCV with n_iter=20 explores 20 of the 3×4×3×3×3=324 possible parameter combinations. If the true best combination was one of the 304 not tried, how would you know? What evidence could you use — other than running more iterations — to gain confidence that the found parameters are near-optimal?

  3. Classes 0 and 1 (Spruce/Fir and Lodgepole Pine) share 195+148=343 errors between each other. These two species dominate the dataset (84% combined). Would SMOTE-oversampling the rare classes (2–6) improve or worsen the confusion between classes 0 and 1? Explain why.

  4. The calibration check shows confidence 0.89 when correct vs 0.56 when wrong. Random Forest's probabilities are computed as the fraction of trees voting for each class. If 90 out of 100 trees vote for class 1: probability = 0.90. If you wanted to use RF probabilities in a downstream cost-sensitive system (where misclassifying class 5 as class 1 costs 10× more), what modification to the decision rule would you make?

  5. Elevation accounts for 23% of feature importance but the dataset has 44 binary soil-type features. Despite having 44× more binary features than Elevation, their combined importance is ~25%. Why does Elevation dominate MDI so completely, and is this consistent or problematic given what we know about MDI's bias toward high-cardinality features?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment