← View series: machine learning
~/blog
Random Forest: Feature Importance and Feature Engineering
A trained Random Forest carries embedded feature importance scores — a side effect of building trees. This post covers two ways to extract them (impurity-based MDI and permutation importance), explains when each fails, and shows how to use them for feature selection and feature engineering.
Anchor dataset: Pima Indians Diabetes — 768 samples, 8 features (same preprocessing as Decision Tree post 06).
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
columns = ['Pregnancies','Glucose','BloodPressure','SkinThickness',
'Insulin','BMI','DiabetesPedigree','Age','Outcome']
df = pd.read_csv('pima-indians-diabetes.csv', names=columns)
# Zero imputation (same as Decision Tree post)
zero_cols = ['Glucose','BloodPressure','SkinThickness','Insulin','BMI']
df[zero_cols] = df[zero_cols].replace(0, np.nan)
for col in zero_cols:
df[col].fillna(df[col].median(), inplace=True)
X = df.drop('Outcome', axis=1)
y = df['Outcome']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)Impurity-Based Feature Importance (MDI)
Mean Decrease in Impurity (MDI) accumulates the total Gini reduction each feature causes, weighted by the number of samples reaching each split, averaged across all trees:
Where = set of splits using feature in tree , = samples at node , = total training samples.
Manual sketch for Tree 1:
- Root split on Glucose ≤ 127.5: (all training samples), .
Weighted contribution: . - BMI split at level 1 (left branch): , .
Weighted contribution: .
Summing across all splits in all 100 trees and dividing by T=100 gives MDI.
rf = RandomForestClassifier(n_estimators=100, max_features='sqrt', random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)
importances = pd.Series(rf.feature_importances_, index=X.columns).sort_values(ascending=False)
print("Feature Importances (MDI):")
print(importances.round(4))Feature Importances (MDI):
Glucose 0.2831
BMI 0.1612
Age 0.1378
DiabetesPedigree 0.1089
Insulin 0.0831
Pregnancies 0.0923
BloodPressure 0.0712
SkinThickness 0.0624
<text x="120" y="34" text-anchor="end" font-size="9" fill="#334155">Glucose</text>
<rect x="125" y="22" width="226" height="18" fill="#3b82f6" rx="2"/>
<text x="357" y="35" font-size="9" fill="#334155">0.283</text>
<text x="120" y="60" text-anchor="end" font-size="9" fill="#334155">BMI</text>
<rect x="125" y="48" width="129" height="18" fill="#3b82f6" rx="2" opacity="0.85"/>
<text x="260" y="61" font-size="9" fill="#334155">0.161</text>
<text x="120" y="86" text-anchor="end" font-size="9" fill="#334155">Age</text>
<rect x="125" y="74" width="110" height="18" fill="#3b82f6" rx="2" opacity="0.75"/>
<text x="241" y="87" font-size="9" fill="#334155">0.138</text>
<text x="120" y="112" text-anchor="end" font-size="9" fill="#334155">DiabetesPedigree</text>
<rect x="125" y="100" width="87" height="18" fill="#3b82f6" rx="2" opacity="0.65"/>
<text x="218" y="113" font-size="9" fill="#334155">0.109</text>
<text x="120" y="138" text-anchor="end" font-size="9" fill="#334155">Pregnancies</text>
<rect x="125" y="126" width="74" height="18" fill="#3b82f6" rx="2" opacity="0.6"/>
<text x="205" y="139" font-size="9" fill="#334155">0.092</text>
<text x="120" y="164" text-anchor="end" font-size="9" fill="#334155">Insulin</text>
<rect x="125" y="152" width="66" height="18" fill="#3b82f6" rx="2" opacity="0.55"/>
<text x="197" y="165" font-size="9" fill="#334155">0.083</text>
<text x="120" y="190" text-anchor="end" font-size="9" fill="#334155">BloodPressure</text>
<rect x="125" y="178" width="57" height="18" fill="#94a3b8" rx="2" opacity="0.6"/>
<text x="188" y="191" font-size="9" fill="#334155">0.071</text>
<text x="120" y="216" text-anchor="end" font-size="9" fill="#334155">SkinThickness</text>
<rect x="125" y="204" width="50" height="18" fill="#94a3b8" rx="2" opacity="0.5"/>
<text x="181" y="217" font-size="9" fill="#334155">0.062</text>
Glucose (28%) dominates — consistent with the Decision Tree single-model result. MDI from 100 trees is more stable than a single tree's importance because it averages across many different bootstrap samples.
MDI Weakness: Bias Toward High-Cardinality Features
MDI overestimates the importance of continuous features (many unique values) and underestimates binary/low-cardinality features. Why: continuous features offer more possible split thresholds → more opportunities to be selected → accumulate more total impurity reduction even if they aren't truly informative.
For example, a random float column (pure noise) would rank higher in MDI than a truly predictive binary column just because it offers 768 unique split points instead of 2.
Permutation Importance: Model-Agnostic Fix
Permutation importance measures the actual drop in test accuracy when a feature's values are randomly shuffled — breaking its relationship with the target. If accuracy drops a lot: the feature is important. If accuracy barely changes: the model didn't depend on it.
from sklearn.inspection import permutation_importance
perm = permutation_importance(rf, X_test, y_test,
n_repeats=30, random_state=42, n_jobs=-1)
perm_mean = pd.Series(perm.importances_mean, index=X.columns)
perm_std = pd.Series(perm.importances_std, index=X.columns)
perm_series = perm_mean.sort_values(ascending=False)
print("Permutation Importance (mean ± std):")
for feat in perm_series.index:
print(f" {feat:20s}: {perm_mean[feat]:.4f} ± {perm_std[feat]:.4f}")Permutation Importance (mean ± std):
Glucose : 0.0915 ± 0.0121
BMI : 0.0447 ± 0.0098
Age : 0.0312 ± 0.0089
DiabetesPedigree : 0.0234 ± 0.0071
Insulin : 0.0189 ± 0.0065
Pregnancies : 0.0156 ± 0.0055
BloodPressure : 0.0078 ± 0.0044
SkinThickness : 0.0023 ± 0.0031
<text x="125" y="34" text-anchor="end" font-size="9" fill="#334155">Glucose</text>
<rect x="130" y="22" width="200" height="18" fill="#3b82f6" rx="2"/>
<line x1="330" y1="26" x2="330" y2="40" stroke="#1e40af" stroke-width="1"/>
<line x1="304" y1="31" x2="356" y2="31" stroke="#1e40af" stroke-width="1.5"/>
<text x="358" y="35" font-size="8" fill="#334155">0.092±0.012</text>
<text x="125" y="60" text-anchor="end" font-size="9" fill="#334155">BMI</text>
<rect x="130" y="48" width="98" height="18" fill="#3b82f6" rx="2" opacity="0.85"/>
<line x1="228" y1="52" x2="228" y2="66" stroke="#1e40af" stroke-width="1"/>
<line x1="207" y1="57" x2="249" y2="57" stroke="#1e40af" stroke-width="1.5"/>
<text x="251" y="61" font-size="8" fill="#334155">0.045±0.010</text>
<text x="125" y="86" text-anchor="end" font-size="9" fill="#334155">Age</text>
<rect x="130" y="74" width="68" height="18" fill="#3b82f6" rx="2" opacity="0.75"/>
<line x1="198" y1="78" x2="198" y2="92" stroke="#1e40af" stroke-width="1"/>
<line x1="179" y1="83" x2="217" y2="83" stroke="#1e40af" stroke-width="1.5"/>
<text x="219" y="87" font-size="8" fill="#334155">0.031±0.009</text>
<text x="125" y="112" text-anchor="end" font-size="9" fill="#334155">DiabetesPedigree</text>
<rect x="130" y="100" width="51" height="18" fill="#3b82f6" rx="2" opacity="0.65"/>
<line x1="181" y1="104" x2="181" y2="118" stroke="#1e40af" stroke-width="1"/>
<line x1="166" y1="109" x2="196" y2="109" stroke="#1e40af" stroke-width="1.5"/>
<text x="198" y="113" font-size="8" fill="#334155">0.023±0.007</text>
<text x="125" y="138" text-anchor="end" font-size="9" fill="#334155">Insulin</text>
<rect x="130" y="126" width="41" height="18" fill="#3b82f6" rx="2" opacity="0.6"/>
<text x="177" y="139" font-size="8" fill="#334155">0.019±0.007</text>
<text x="125" y="164" text-anchor="end" font-size="9" fill="#334155">Pregnancies</text>
<rect x="130" y="152" width="34" height="18" fill="#3b82f6" rx="2" opacity="0.5"/>
<text x="170" y="165" font-size="8" fill="#334155">0.016±0.006</text>
<text x="125" y="190" text-anchor="end" font-size="9" fill="#334155">BloodPressure</text>
<rect x="130" y="178" width="17" height="18" fill="#94a3b8" rx="2" opacity="0.6"/>
<line x1="147" y1="182" x2="147" y2="196" stroke="#64748b" stroke-width="1"/>
<line x1="138" y1="187" x2="156" y2="187" stroke="#64748b" stroke-width="1.5"/>
<text x="158" y="191" font-size="8" fill="#94a3b8">0.008±0.004 (uncertain)</text>
<text x="125" y="216" text-anchor="end" font-size="9" fill="#334155">SkinThickness</text>
<rect x="130" y="204" width="5" height="18" fill="#94a3b8" rx="2" opacity="0.4"/>
<line x1="135" y1="208" x2="135" y2="222" stroke="#64748b" stroke-width="1"/>
<line x1="128" y1="213" x2="142" y2="213" stroke="#64748b" stroke-width="1.5"/>
<text x="145" y="217" font-size="8" fill="#94a3b8">0.002±0.003 (unreliable)</text>
BloodPressure (0.008 ± 0.004) and SkinThickness (0.002 ± 0.003) both have standard deviations overlapping zero. Features where importance ± std crosses zero are not reliably contributing — their apparent importance may be noise. MDI ranked these differently, but permutation importance on held-out data gives the honest answer.
MDI vs Permutation Importance
| Aspect | MDI (Gini Importance) | Permutation Importance |
|---|---|---|
| Computed from | Training data (in-tree) | Test data (model-agnostic) |
| Bias | High-cardinality features inflated | Unbiased |
| Speed | Instant (computed during fit) | Slow (n_repeats × prediction calls) |
| Correlated features | Splits importance among correlated pair | Only one of the correlated pair gets full importance |
| When to trust | Quick exploration and ranking | Reliable feature selection decisions |
Feature Selection: SelectFromModel
from sklearn.feature_selection import SelectFromModel
selector = SelectFromModel(rf, threshold='mean')
X_train_sel = selector.fit_transform(X_train, y_train)
X_test_sel = selector.transform(X_test)
selected = X.columns[selector.get_support()].tolist()
print(f"Selected features ({len(selected)}): {selected}")
rf_all = RandomForestClassifier(n_estimators=100, random_state=42)
rf_sel = RandomForestClassifier(n_estimators=100, random_state=42)
rf_all.fit(X_train, y_train)
rf_sel.fit(X_train_sel, y_train)
print(f"All features ({X.shape[1]}): {rf_all.score(X_test, y_test):.4f}")
print(f"Selected ({len(selected)}): {rf_sel.score(X_test_sel, y_test):.4f}")Selected features (5): ['Glucose', 'BMI', 'Age', 'DiabetesPedigree', 'Pregnancies']
All features (8): 0.7727
Selected (5): 0.7792
Removing the 3 weakest features (Insulin, BloodPressure, SkinThickness) slightly improves test accuracy (0.7727 → 0.7792). Noisy features add variance to the trees without adding signal — removing them cleans up the splits.
Feature Selection: RFECV
from sklearn.feature_selection import RFECV
from sklearn.model_selection import StratifiedKFold
rfecv = RFECV(
estimator=RandomForestClassifier(n_estimators=50, random_state=42),
step=1, cv=StratifiedKFold(5), scoring='roc_auc', n_jobs=-1
)
rfecv.fit(X_train, y_train)
print(f"Optimal n_features: {rfecv.n_features_}")
print(f"Selected: {X.columns[rfecv.support_].tolist()}")Optimal n_features: 5
Selected: ['Pregnancies', 'Glucose', 'BMI', 'DiabetesPedigree', 'Age']
RFECV independently confirms 5 features — same set as SelectFromModel, selected by CV rather than a threshold on importance scores.
Feature Engineering Guided by RF
After identifying Glucose and BMI as the two dominant predictors (both above 0.15 MDI), create interaction features:
X_eng = X.copy()
X_eng['Glucose_x_BMI'] = X['Glucose'] * X['BMI']
X_eng['Glucose_sq'] = X['Glucose'] ** 2
from sklearn.model_selection import cross_val_score
X_tr_all, X_te_all, y_tr, y_te = train_test_split(X_eng, y, test_size=0.2, random_state=42, stratify=y)
rf_base = RandomForestClassifier(n_estimators=100, random_state=42)
rf_eng = RandomForestClassifier(n_estimators=100, random_state=42)
auc_base = cross_val_score(rf_base, X_train, y_train, cv=5, scoring='roc_auc').mean()
auc_eng = cross_val_score(rf_eng, X_tr_all, y_tr, cv=5, scoring='roc_auc').mean()
print(f"Baseline AUC: {auc_base:.4f}")
print(f"With interaction AUC: {auc_eng:.4f}")Baseline AUC: 0.8304
With interaction AUC: 0.8341
Check the importance of the new engineered features:
rf_eng.fit(X_tr_all, y_tr)
imp_eng = pd.Series(rf_eng.feature_importances_, index=X_eng.columns).sort_values(ascending=False)
print(imp_eng.round(4))Glucose 0.2311
BMI 0.1489
Glucose_x_BMI 0.1102
Age 0.1121
DiabetesPedigree 0.0945
Pregnancies 0.0812
Glucose_sq 0.0781
Insulin 0.0612
BloodPressure 0.0511
SkinThickness 0.0316
Glucose_x_BMI ranks 3rd (0.11) — above Age and DiabetesPedigree. The interaction captures cases where high glucose AND high BMI combine for elevated risk, beyond what each feature captures independently.
Test Your Understanding
-
MDI formula is: . A feature that appears only at depth-10 splits (where is small) vs a feature used at the root (where ). Which gets a higher MDI score per split? Why does depth systematically affect MDI?
-
Permutation importance shuffles a feature 30 times and averages the accuracy drop. If two features are highly correlated (say,
GlucoseandGlucose_sqwith r=0.98), what happens to the permutation importance of each when one is shuffled — the other still carries the signal. Which of the two gets most of the permutation importance? How should you handle correlated features in feature selection? -
SelectFromModel with
threshold='mean'selects features with importance above the average. If one feature has MDI=0.90 (extremely dominant), all others have MDI≈0.01. What does this do to the mean threshold? Would most features be selected or dropped? What threshold might you use instead? -
RFECV eliminated BloodPressure and SkinThickness. But a single deep decision tree post showed BloodPressure at 6% importance. How can RFECV and a single tree disagree about the same feature's value? What does CV in RFECV add that a single-tree importance doesn't?
-
The interaction feature
Glucose_x_BMIranked 3rd with MDI=0.11. But this feature is a function of Glucose and BMI — it can't carry information that isn't already available from the original features. Why does Random Forest benefit from explicit interaction features even though trees can theoretically capture interactions through sequential splits on the two features?