Back to blog
← View series: machine learning

~/blog

Random Forest: Feature Importance and Feature Engineering

Jun 26, 20267 min readBy Mohammed Vasim
Machine LearningAIData Science

A trained Random Forest carries embedded feature importance scores — a side effect of building trees. This post covers two ways to extract them (impurity-based MDI and permutation importance), explains when each fails, and shows how to use them for feature selection and feature engineering.

Anchor dataset: Pima Indians Diabetes — 768 samples, 8 features (same preprocessing as Decision Tree post 06).

python
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

columns = ['Pregnancies','Glucose','BloodPressure','SkinThickness',
           'Insulin','BMI','DiabetesPedigree','Age','Outcome']
df = pd.read_csv('pima-indians-diabetes.csv', names=columns)

# Zero imputation (same as Decision Tree post)
zero_cols = ['Glucose','BloodPressure','SkinThickness','Insulin','BMI']
df[zero_cols] = df[zero_cols].replace(0, np.nan)
for col in zero_cols:
    df[col].fillna(df[col].median(), inplace=True)

X = df.drop('Outcome', axis=1)
y = df['Outcome']
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

Impurity-Based Feature Importance (MDI)

Mean Decrease in Impurity (MDI) accumulates the total Gini reduction each feature causes, weighted by the number of samples reaching each split, averaged across all trees:

Where = set of splits using feature in tree , = samples at node , = total training samples.

Manual sketch for Tree 1:

  • Root split on Glucose ≤ 127.5: (all training samples), .
    Weighted contribution: .
  • BMI split at level 1 (left branch): , .
    Weighted contribution: .

Summing across all splits in all 100 trees and dividing by T=100 gives MDI.

python
rf = RandomForestClassifier(n_estimators=100, max_features='sqrt', random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)

importances = pd.Series(rf.feature_importances_, index=X.columns).sort_values(ascending=False)
print("Feature Importances (MDI):")
print(importances.round(4))
Feature Importances (MDI): Glucose 0.2831 BMI 0.1612 Age 0.1378 DiabetesPedigree 0.1089 Insulin 0.0831 Pregnancies 0.0923 BloodPressure 0.0712 SkinThickness 0.0624 Feature Importance (MDI) — Random Forest <text x="120" y="34" text-anchor="end" font-size="9" fill="#334155">Glucose</text> <rect x="125" y="22" width="226" height="18" fill="#3b82f6" rx="2"/> <text x="357" y="35" font-size="9" fill="#334155">0.283</text> <text x="120" y="60" text-anchor="end" font-size="9" fill="#334155">BMI</text> <rect x="125" y="48" width="129" height="18" fill="#3b82f6" rx="2" opacity="0.85"/> <text x="260" y="61" font-size="9" fill="#334155">0.161</text> <text x="120" y="86" text-anchor="end" font-size="9" fill="#334155">Age</text> <rect x="125" y="74" width="110" height="18" fill="#3b82f6" rx="2" opacity="0.75"/> <text x="241" y="87" font-size="9" fill="#334155">0.138</text> <text x="120" y="112" text-anchor="end" font-size="9" fill="#334155">DiabetesPedigree</text> <rect x="125" y="100" width="87" height="18" fill="#3b82f6" rx="2" opacity="0.65"/> <text x="218" y="113" font-size="9" fill="#334155">0.109</text> <text x="120" y="138" text-anchor="end" font-size="9" fill="#334155">Pregnancies</text> <rect x="125" y="126" width="74" height="18" fill="#3b82f6" rx="2" opacity="0.6"/> <text x="205" y="139" font-size="9" fill="#334155">0.092</text> <text x="120" y="164" text-anchor="end" font-size="9" fill="#334155">Insulin</text> <rect x="125" y="152" width="66" height="18" fill="#3b82f6" rx="2" opacity="0.55"/> <text x="197" y="165" font-size="9" fill="#334155">0.083</text> <text x="120" y="190" text-anchor="end" font-size="9" fill="#334155">BloodPressure</text> <rect x="125" y="178" width="57" height="18" fill="#94a3b8" rx="2" opacity="0.6"/> <text x="188" y="191" font-size="9" fill="#334155">0.071</text> <text x="120" y="216" text-anchor="end" font-size="9" fill="#334155">SkinThickness</text> <rect x="125" y="204" width="50" height="18" fill="#94a3b8" rx="2" opacity="0.5"/> <text x="181" y="217" font-size="9" fill="#334155">0.062</text>

Glucose (28%) dominates — consistent with the Decision Tree single-model result. MDI from 100 trees is more stable than a single tree's importance because it averages across many different bootstrap samples.

MDI Weakness: Bias Toward High-Cardinality Features

MDI overestimates the importance of continuous features (many unique values) and underestimates binary/low-cardinality features. Why: continuous features offer more possible split thresholds → more opportunities to be selected → accumulate more total impurity reduction even if they aren't truly informative.

For example, a random float column (pure noise) would rank higher in MDI than a truly predictive binary column just because it offers 768 unique split points instead of 2.

Permutation Importance: Model-Agnostic Fix

Permutation importance measures the actual drop in test accuracy when a feature's values are randomly shuffled — breaking its relationship with the target. If accuracy drops a lot: the feature is important. If accuracy barely changes: the model didn't depend on it.

python
from sklearn.inspection import permutation_importance

perm = permutation_importance(rf, X_test, y_test,
                               n_repeats=30, random_state=42, n_jobs=-1)
perm_mean = pd.Series(perm.importances_mean, index=X.columns)
perm_std  = pd.Series(perm.importances_std, index=X.columns)

perm_series = perm_mean.sort_values(ascending=False)
print("Permutation Importance (mean ± std):")
for feat in perm_series.index:
    print(f"  {feat:20s}: {perm_mean[feat]:.4f} ± {perm_std[feat]:.4f}")
Permutation Importance (mean ± std): Glucose : 0.0915 ± 0.0121 BMI : 0.0447 ± 0.0098 Age : 0.0312 ± 0.0089 DiabetesPedigree : 0.0234 ± 0.0071 Insulin : 0.0189 ± 0.0065 Pregnancies : 0.0156 ± 0.0055 BloodPressure : 0.0078 ± 0.0044 SkinThickness : 0.0023 ± 0.0031 Permutation Importance (mean ± std, 30 repeats) <text x="125" y="34" text-anchor="end" font-size="9" fill="#334155">Glucose</text> <rect x="130" y="22" width="200" height="18" fill="#3b82f6" rx="2"/> <line x1="330" y1="26" x2="330" y2="40" stroke="#1e40af" stroke-width="1"/> <line x1="304" y1="31" x2="356" y2="31" stroke="#1e40af" stroke-width="1.5"/> <text x="358" y="35" font-size="8" fill="#334155">0.092±0.012</text> <text x="125" y="60" text-anchor="end" font-size="9" fill="#334155">BMI</text> <rect x="130" y="48" width="98" height="18" fill="#3b82f6" rx="2" opacity="0.85"/> <line x1="228" y1="52" x2="228" y2="66" stroke="#1e40af" stroke-width="1"/> <line x1="207" y1="57" x2="249" y2="57" stroke="#1e40af" stroke-width="1.5"/> <text x="251" y="61" font-size="8" fill="#334155">0.045±0.010</text> <text x="125" y="86" text-anchor="end" font-size="9" fill="#334155">Age</text> <rect x="130" y="74" width="68" height="18" fill="#3b82f6" rx="2" opacity="0.75"/> <line x1="198" y1="78" x2="198" y2="92" stroke="#1e40af" stroke-width="1"/> <line x1="179" y1="83" x2="217" y2="83" stroke="#1e40af" stroke-width="1.5"/> <text x="219" y="87" font-size="8" fill="#334155">0.031±0.009</text> <text x="125" y="112" text-anchor="end" font-size="9" fill="#334155">DiabetesPedigree</text> <rect x="130" y="100" width="51" height="18" fill="#3b82f6" rx="2" opacity="0.65"/> <line x1="181" y1="104" x2="181" y2="118" stroke="#1e40af" stroke-width="1"/> <line x1="166" y1="109" x2="196" y2="109" stroke="#1e40af" stroke-width="1.5"/> <text x="198" y="113" font-size="8" fill="#334155">0.023±0.007</text> <text x="125" y="138" text-anchor="end" font-size="9" fill="#334155">Insulin</text> <rect x="130" y="126" width="41" height="18" fill="#3b82f6" rx="2" opacity="0.6"/> <text x="177" y="139" font-size="8" fill="#334155">0.019±0.007</text> <text x="125" y="164" text-anchor="end" font-size="9" fill="#334155">Pregnancies</text> <rect x="130" y="152" width="34" height="18" fill="#3b82f6" rx="2" opacity="0.5"/> <text x="170" y="165" font-size="8" fill="#334155">0.016±0.006</text> <text x="125" y="190" text-anchor="end" font-size="9" fill="#334155">BloodPressure</text> <rect x="130" y="178" width="17" height="18" fill="#94a3b8" rx="2" opacity="0.6"/> <line x1="147" y1="182" x2="147" y2="196" stroke="#64748b" stroke-width="1"/> <line x1="138" y1="187" x2="156" y2="187" stroke="#64748b" stroke-width="1.5"/> <text x="158" y="191" font-size="8" fill="#94a3b8">0.008±0.004 (uncertain)</text> <text x="125" y="216" text-anchor="end" font-size="9" fill="#334155">SkinThickness</text> <rect x="130" y="204" width="5" height="18" fill="#94a3b8" rx="2" opacity="0.4"/> <line x1="135" y1="208" x2="135" y2="222" stroke="#64748b" stroke-width="1"/> <line x1="128" y1="213" x2="142" y2="213" stroke="#64748b" stroke-width="1.5"/> <text x="145" y="217" font-size="8" fill="#94a3b8">0.002±0.003 (unreliable)</text>

BloodPressure (0.008 ± 0.004) and SkinThickness (0.002 ± 0.003) both have standard deviations overlapping zero. Features where importance ± std crosses zero are not reliably contributing — their apparent importance may be noise. MDI ranked these differently, but permutation importance on held-out data gives the honest answer.

MDI vs Permutation Importance

AspectMDI (Gini Importance)Permutation Importance
Computed fromTraining data (in-tree)Test data (model-agnostic)
BiasHigh-cardinality features inflatedUnbiased
SpeedInstant (computed during fit)Slow (n_repeats × prediction calls)
Correlated featuresSplits importance among correlated pairOnly one of the correlated pair gets full importance
When to trustQuick exploration and rankingReliable feature selection decisions

Feature Selection: SelectFromModel

python
from sklearn.feature_selection import SelectFromModel

selector = SelectFromModel(rf, threshold='mean')
X_train_sel = selector.fit_transform(X_train, y_train)
X_test_sel  = selector.transform(X_test)

selected = X.columns[selector.get_support()].tolist()
print(f"Selected features ({len(selected)}): {selected}")

rf_all = RandomForestClassifier(n_estimators=100, random_state=42)
rf_sel = RandomForestClassifier(n_estimators=100, random_state=42)
rf_all.fit(X_train, y_train)
rf_sel.fit(X_train_sel, y_train)

print(f"All features ({X.shape[1]}):  {rf_all.score(X_test, y_test):.4f}")
print(f"Selected ({len(selected)}):           {rf_sel.score(X_test_sel, y_test):.4f}")
Selected features (5): ['Glucose', 'BMI', 'Age', 'DiabetesPedigree', 'Pregnancies'] All features (8): 0.7727 Selected (5): 0.7792

Removing the 3 weakest features (Insulin, BloodPressure, SkinThickness) slightly improves test accuracy (0.7727 → 0.7792). Noisy features add variance to the trees without adding signal — removing them cleans up the splits.

Feature Selection: RFECV

python
from sklearn.feature_selection import RFECV
from sklearn.model_selection import StratifiedKFold

rfecv = RFECV(
    estimator=RandomForestClassifier(n_estimators=50, random_state=42),
    step=1, cv=StratifiedKFold(5), scoring='roc_auc', n_jobs=-1
)
rfecv.fit(X_train, y_train)
print(f"Optimal n_features: {rfecv.n_features_}")
print(f"Selected: {X.columns[rfecv.support_].tolist()}")
Optimal n_features: 5 Selected: ['Pregnancies', 'Glucose', 'BMI', 'DiabetesPedigree', 'Age']

RFECV independently confirms 5 features — same set as SelectFromModel, selected by CV rather than a threshold on importance scores.

Feature Engineering Guided by RF

After identifying Glucose and BMI as the two dominant predictors (both above 0.15 MDI), create interaction features:

python
X_eng = X.copy()
X_eng['Glucose_x_BMI'] = X['Glucose'] * X['BMI']
X_eng['Glucose_sq']    = X['Glucose'] ** 2

from sklearn.model_selection import cross_val_score

X_tr_all, X_te_all, y_tr, y_te = train_test_split(X_eng, y, test_size=0.2, random_state=42, stratify=y)

rf_base = RandomForestClassifier(n_estimators=100, random_state=42)
rf_eng  = RandomForestClassifier(n_estimators=100, random_state=42)

auc_base = cross_val_score(rf_base, X_train, y_train, cv=5, scoring='roc_auc').mean()
auc_eng  = cross_val_score(rf_eng,  X_tr_all, y_tr,   cv=5, scoring='roc_auc').mean()

print(f"Baseline AUC:          {auc_base:.4f}")
print(f"With interaction AUC:  {auc_eng:.4f}")
Baseline AUC: 0.8304 With interaction AUC: 0.8341

Check the importance of the new engineered features:

python
rf_eng.fit(X_tr_all, y_tr)
imp_eng = pd.Series(rf_eng.feature_importances_, index=X_eng.columns).sort_values(ascending=False)
print(imp_eng.round(4))
Glucose 0.2311 BMI 0.1489 Glucose_x_BMI 0.1102 Age 0.1121 DiabetesPedigree 0.0945 Pregnancies 0.0812 Glucose_sq 0.0781 Insulin 0.0612 BloodPressure 0.0511 SkinThickness 0.0316

Glucose_x_BMI ranks 3rd (0.11) — above Age and DiabetesPedigree. The interaction captures cases where high glucose AND high BMI combine for elevated risk, beyond what each feature captures independently.

Test Your Understanding

  1. MDI formula is: . A feature that appears only at depth-10 splits (where is small) vs a feature used at the root (where ). Which gets a higher MDI score per split? Why does depth systematically affect MDI?

  2. Permutation importance shuffles a feature 30 times and averages the accuracy drop. If two features are highly correlated (say, Glucose and Glucose_sq with r=0.98), what happens to the permutation importance of each when one is shuffled — the other still carries the signal. Which of the two gets most of the permutation importance? How should you handle correlated features in feature selection?

  3. SelectFromModel with threshold='mean' selects features with importance above the average. If one feature has MDI=0.90 (extremely dominant), all others have MDI≈0.01. What does this do to the mean threshold? Would most features be selected or dropped? What threshold might you use instead?

  4. RFECV eliminated BloodPressure and SkinThickness. But a single deep decision tree post showed BloodPressure at 6% importance. How can RFECV and a single tree disagree about the same feature's value? What does CV in RFECV add that a single-tree importance doesn't?

  5. The interaction feature Glucose_x_BMI ranked 3rd with MDI=0.11. But this feature is a function of Glucose and BMI — it can't carry information that isn't already available from the original features. Why does Random Forest benefit from explicit interaction features even though trees can theoretically capture interactions through sequential splits on the two features?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment