Back to blog
← View series: machine learning

Bagging and Boosting: Ensemble Intuition Random Forest: Algorithm and Regression Random Forest: Feature Importance and Feature Engineering Random Forest: Forest Cover Type Project AdaBoost: Algorithm Intuition AdaBoost: Implementation and Hyperparameter Tuning Gradient Boosting: Regression and Classification XGBoost: Intuition and Math XGBoost: Implementation and Final Comparison

~/blog

Random Forest: Feature Importance and Feature Engineering

Jun 26, 2026•7 min read•By Mohammed Vasim

Machine LearningAIData Science

A trained Random Forest carries embedded feature importance scores — a side effect of building trees. This post covers two ways to extract them (impurity-based MDI and permutation importance), explains when each fails, and shows how to use them for feature selection and feature engineering.

Anchor dataset: Pima Indians Diabetes — 768 samples, 8 features (same preprocessing as Decision Tree post 06).

python

import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

columns = ['Pregnancies','Glucose','BloodPressure','SkinThickness',
           'Insulin','BMI','DiabetesPedigree','Age','Outcome']
df = pd.read_csv('pima-indians-diabetes.csv', names=columns)

# Zero imputation (same as Decision Tree post)
zero_cols = ['Glucose','BloodPressure','SkinThickness','Insulin','BMI']
df[zero_cols] = df[zero_cols].replace(0, np.nan)
for col in zero_cols:
    df[col].fillna(df[col].median(), inplace=True)

X = df.drop('Outcome', axis=1)
y = df['Outcome']
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

Impurity-Based Feature Importance (MDI)

Mean Decrease in Impurity (MDI) accumulates the total Gini reduction each feature causes, weighted by the number of samples reaching each split, averaged across all trees:

$importance (j) = \frac{1}{T} \sum_{t = 1}^{T} \sum_{s \in S_{t} (j)} \frac{n _{s}}{n} \cdot Δ Gini (s)$

Where $S_{t} (j)$ = set of splits using feature $j$ in tree $t$ , $n_{s}$ = samples at node $s$ , $n$ = total training samples.

Manual sketch for Tree 1:

Root split on Glucose ≤ 127.5: $n_{s} = 614$ (all training samples), $Δ Gini = 0.15$ .
Weighted contribution: $614/614 \times 0.15 = 0.150$ .
BMI split at level 1 (left branch): $n_{s} = 380$ , $Δ Gini = 0.08$ .
Weighted contribution: $380/614 \times 0.08 = 0.0495$ .

Summing across all splits in all 100 trees and dividing by T=100 gives MDI.

python

rf = RandomForestClassifier(n_estimators=100, max_features='sqrt', random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)

importances = pd.Series(rf.feature_importances_, index=X.columns).sort_values(ascending=False)
print("Feature Importances (MDI):")
print(importances.round(4))

Feature Importances (MDI):
Glucose            0.2831
BMI                0.1612
Age                0.1378
DiabetesPedigree   0.1089
Insulin            0.0831
Pregnancies        0.0923
BloodPressure      0.0712
SkinThickness      0.0624

<text x="120" y="34" text-anchor="end" font-size="9" fill="#334155">Glucose</text>
<rect x="125" y="22" width="226" height="18" fill="#3b82f6" rx="2"/>
<text x="357" y="35" font-size="9" fill="#334155">0.283</text>

<text x="120" y="60" text-anchor="end" font-size="9" fill="#334155">BMI</text>
<rect x="125" y="48" width="129" height="18" fill="#3b82f6" rx="2" opacity="0.85"/>
<text x="260" y="61" font-size="9" fill="#334155">0.161</text>

<text x="120" y="86" text-anchor="end" font-size="9" fill="#334155">Age</text>
<rect x="125" y="74" width="110" height="18" fill="#3b82f6" rx="2" opacity="0.75"/>
<text x="241" y="87" font-size="9" fill="#334155">0.138</text>

<text x="120" y="112" text-anchor="end" font-size="9" fill="#334155">DiabetesPedigree</text>
<rect x="125" y="100" width="87" height="18" fill="#3b82f6" rx="2" opacity="0.65"/>
<text x="218" y="113" font-size="9" fill="#334155">0.109</text>

<text x="120" y="138" text-anchor="end" font-size="9" fill="#334155">Pregnancies</text>
<rect x="125" y="126" width="74" height="18" fill="#3b82f6" rx="2" opacity="0.6"/>
<text x="205" y="139" font-size="9" fill="#334155">0.092</text>

<text x="120" y="164" text-anchor="end" font-size="9" fill="#334155">Insulin</text>
<rect x="125" y="152" width="66" height="18" fill="#3b82f6" rx="2" opacity="0.55"/>
<text x="197" y="165" font-size="9" fill="#334155">0.083</text>

<text x="120" y="190" text-anchor="end" font-size="9" fill="#334155">BloodPressure</text>
<rect x="125" y="178" width="57" height="18" fill="#94a3b8" rx="2" opacity="0.6"/>
<text x="188" y="191" font-size="9" fill="#334155">0.071</text>

<text x="120" y="216" text-anchor="end" font-size="9" fill="#334155">SkinThickness</text>
<rect x="125" y="204" width="50" height="18" fill="#94a3b8" rx="2" opacity="0.5"/>
<text x="181" y="217" font-size="9" fill="#334155">0.062</text>

Glucose (28%) dominates — consistent with the Decision Tree single-model result. MDI from 100 trees is more stable than a single tree's importance because it averages across many different bootstrap samples.

MDI Weakness: Bias Toward High-Cardinality Features

MDI overestimates the importance of continuous features (many unique values) and underestimates binary/low-cardinality features. Why: continuous features offer more possible split thresholds → more opportunities to be selected → accumulate more total impurity reduction even if they aren't truly informative.

For example, a random float column (pure noise) would rank higher in MDI than a truly predictive binary column just because it offers 768 unique split points instead of 2.

Permutation Importance: Model-Agnostic Fix

Permutation importance measures the actual drop in test accuracy when a feature's values are randomly shuffled — breaking its relationship with the target. If accuracy drops a lot: the feature is important. If accuracy barely changes: the model didn't depend on it.

python

from sklearn.inspection import permutation_importance

perm = permutation_importance(rf, X_test, y_test,
                               n_repeats=30, random_state=42, n_jobs=-1)
perm_mean = pd.Series(perm.importances_mean, index=X.columns)
perm_std  = pd.Series(perm.importances_std, index=X.columns)

perm_series = perm_mean.sort_values(ascending=False)
print("Permutation Importance (mean ± std):")
for feat in perm_series.index:
    print(f"  {feat:20s}: {perm_mean[feat]:.4f} ± {perm_std[feat]:.4f}")

Permutation Importance (mean ± std):
  Glucose             : 0.0915 ± 0.0121
  BMI                 : 0.0447 ± 0.0098
  Age                 : 0.0312 ± 0.0089
  DiabetesPedigree    : 0.0234 ± 0.0071
  Insulin             : 0.0189 ± 0.0065
  Pregnancies         : 0.0156 ± 0.0055
  BloodPressure       : 0.0078 ± 0.0044
  SkinThickness       : 0.0023 ± 0.0031

<text x="125" y="34" text-anchor="end" font-size="9" fill="#334155">Glucose</text>
<rect x="130" y="22" width="200" height="18" fill="#3b82f6" rx="2"/>
<line x1="330" y1="26" x2="330" y2="40" stroke="#1e40af" stroke-width="1"/>
<line x1="304" y1="31" x2="356" y2="31" stroke="#1e40af" stroke-width="1.5"/>
<text x="358" y="35" font-size="8" fill="#334155">0.092±0.012</text>

<text x="125" y="60" text-anchor="end" font-size="9" fill="#334155">BMI</text>
<rect x="130" y="48" width="98" height="18" fill="#3b82f6" rx="2" opacity="0.85"/>
<line x1="228" y1="52" x2="228" y2="66" stroke="#1e40af" stroke-width="1"/>
<line x1="207" y1="57" x2="249" y2="57" stroke="#1e40af" stroke-width="1.5"/>
<text x="251" y="61" font-size="8" fill="#334155">0.045±0.010</text>

<text x="125" y="86" text-anchor="end" font-size="9" fill="#334155">Age</text>
<rect x="130" y="74" width="68" height="18" fill="#3b82f6" rx="2" opacity="0.75"/>
<line x1="198" y1="78" x2="198" y2="92" stroke="#1e40af" stroke-width="1"/>
<line x1="179" y1="83" x2="217" y2="83" stroke="#1e40af" stroke-width="1.5"/>
<text x="219" y="87" font-size="8" fill="#334155">0.031±0.009</text>

<text x="125" y="112" text-anchor="end" font-size="9" fill="#334155">DiabetesPedigree</text>
<rect x="130" y="100" width="51" height="18" fill="#3b82f6" rx="2" opacity="0.65"/>
<line x1="181" y1="104" x2="181" y2="118" stroke="#1e40af" stroke-width="1"/>
<line x1="166" y1="109" x2="196" y2="109" stroke="#1e40af" stroke-width="1.5"/>
<text x="198" y="113" font-size="8" fill="#334155">0.023±0.007</text>

<text x="125" y="138" text-anchor="end" font-size="9" fill="#334155">Insulin</text>
<rect x="130" y="126" width="41" height="18" fill="#3b82f6" rx="2" opacity="0.6"/>
<text x="177" y="139" font-size="8" fill="#334155">0.019±0.007</text>

<text x="125" y="164" text-anchor="end" font-size="9" fill="#334155">Pregnancies</text>
<rect x="130" y="152" width="34" height="18" fill="#3b82f6" rx="2" opacity="0.5"/>
<text x="170" y="165" font-size="8" fill="#334155">0.016±0.006</text>

<text x="125" y="190" text-anchor="end" font-size="9" fill="#334155">BloodPressure</text>
<rect x="130" y="178" width="17" height="18" fill="#94a3b8" rx="2" opacity="0.6"/>
<line x1="147" y1="182" x2="147" y2="196" stroke="#64748b" stroke-width="1"/>
<line x1="138" y1="187" x2="156" y2="187" stroke="#64748b" stroke-width="1.5"/>
<text x="158" y="191" font-size="8" fill="#94a3b8">0.008±0.004 (uncertain)</text>

<text x="125" y="216" text-anchor="end" font-size="9" fill="#334155">SkinThickness</text>
<rect x="130" y="204" width="5" height="18" fill="#94a3b8" rx="2" opacity="0.4"/>
<line x1="135" y1="208" x2="135" y2="222" stroke="#64748b" stroke-width="1"/>
<line x1="128" y1="213" x2="142" y2="213" stroke="#64748b" stroke-width="1.5"/>
<text x="145" y="217" font-size="8" fill="#94a3b8">0.002±0.003 (unreliable)</text>

BloodPressure (0.008 ± 0.004) and SkinThickness (0.002 ± 0.003) both have standard deviations overlapping zero. Features where importance ± std crosses zero are not reliably contributing — their apparent importance may be noise. MDI ranked these differently, but permutation importance on held-out data gives the honest answer.

MDI vs Permutation Importance

Aspect	MDI (Gini Importance)	Permutation Importance
Computed from	Training data (in-tree)	Test data (model-agnostic)
Bias	High-cardinality features inflated	Unbiased
Speed	Instant (computed during fit)	Slow (n_repeats × prediction calls)
Correlated features	Splits importance among correlated pair	Only one of the correlated pair gets full importance
When to trust	Quick exploration and ranking	Reliable feature selection decisions

Feature Selection: SelectFromModel

python

from sklearn.feature_selection import SelectFromModel

selector = SelectFromModel(rf, threshold='mean')
X_train_sel = selector.fit_transform(X_train, y_train)
X_test_sel  = selector.transform(X_test)

selected = X.columns[selector.get_support()].tolist()
print(f"Selected features ({len(selected)}): {selected}")

rf_all = RandomForestClassifier(n_estimators=100, random_state=42)
rf_sel = RandomForestClassifier(n_estimators=100, random_state=42)
rf_all.fit(X_train, y_train)
rf_sel.fit(X_train_sel, y_train)

print(f"All features ({X.shape[1]}):  {rf_all.score(X_test, y_test):.4f}")
print(f"Selected ({len(selected)}):           {rf_sel.score(X_test_sel, y_test):.4f}")

Selected features (5): ['Glucose', 'BMI', 'Age', 'DiabetesPedigree', 'Pregnancies']
All features (8):  0.7727
Selected (5):      0.7792

Removing the 3 weakest features (Insulin, BloodPressure, SkinThickness) slightly improves test accuracy (0.7727 → 0.7792). Noisy features add variance to the trees without adding signal — removing them cleans up the splits.

Feature Selection: RFECV

python

from sklearn.feature_selection import RFECV
from sklearn.model_selection import StratifiedKFold

rfecv = RFECV(
    estimator=RandomForestClassifier(n_estimators=50, random_state=42),
    step=1, cv=StratifiedKFold(5), scoring='roc_auc', n_jobs=-1
)
rfecv.fit(X_train, y_train)
print(f"Optimal n_features: {rfecv.n_features_}")
print(f"Selected: {X.columns[rfecv.support_].tolist()}")

Optimal n_features: 5
Selected: ['Pregnancies', 'Glucose', 'BMI', 'DiabetesPedigree', 'Age']

RFECV independently confirms 5 features — same set as SelectFromModel, selected by CV rather than a threshold on importance scores.

Feature Engineering Guided by RF

After identifying Glucose and BMI as the two dominant predictors (both above 0.15 MDI), create interaction features:

python

X_eng = X.copy()
X_eng['Glucose_x_BMI'] = X['Glucose'] * X['BMI']
X_eng['Glucose_sq']    = X['Glucose'] ** 2

from sklearn.model_selection import cross_val_score

X_tr_all, X_te_all, y_tr, y_te = train_test_split(X_eng, y, test_size=0.2, random_state=42, stratify=y)

rf_base = RandomForestClassifier(n_estimators=100, random_state=42)
rf_eng  = RandomForestClassifier(n_estimators=100, random_state=42)

auc_base = cross_val_score(rf_base, X_train, y_train, cv=5, scoring='roc_auc').mean()
auc_eng  = cross_val_score(rf_eng,  X_tr_all, y_tr,   cv=5, scoring='roc_auc').mean()

print(f"Baseline AUC:          {auc_base:.4f}")
print(f"With interaction AUC:  {auc_eng:.4f}")

Baseline AUC:          0.8304
With interaction AUC:  0.8341

Check the importance of the new engineered features:

python

rf_eng.fit(X_tr_all, y_tr)
imp_eng = pd.Series(rf_eng.feature_importances_, index=X_eng.columns).sort_values(ascending=False)
print(imp_eng.round(4))

Glucose            0.2311
BMI                0.1489
Glucose_x_BMI      0.1102
Age                0.1121
DiabetesPedigree   0.0945
Pregnancies        0.0812
Glucose_sq         0.0781
Insulin            0.0612
BloodPressure      0.0511
SkinThickness      0.0316

Glucose_x_BMI ranks 3rd (0.11) — above Age and DiabetesPedigree. The interaction captures cases where high glucose AND high BMI combine for elevated risk, beyond what each feature captures independently.

Test Your Understanding

MDI formula is: $importance (j) = \frac{1}{T} \sum_{t} \sum_{s \in S_{t} (j)} \frac{n _{s}}{n} Δ Gini (s)$ . A feature that appears only at depth-10 splits (where $n_{s}$ is small) vs a feature used at the root (where $n_{s} = n$ ). Which gets a higher MDI score per split? Why does depth systematically affect MDI?
Permutation importance shuffles a feature 30 times and averages the accuracy drop. If two features are highly correlated (say, Glucose and Glucose_sq with r=0.98), what happens to the permutation importance of each when one is shuffled — the other still carries the signal. Which of the two gets most of the permutation importance? How should you handle correlated features in feature selection?
SelectFromModel with threshold='mean' selects features with importance above the average. If one feature has MDI=0.90 (extremely dominant), all others have MDI≈0.01. What does this do to the mean threshold? Would most features be selected or dropped? What threshold might you use instead?
RFECV eliminated BloodPressure and SkinThickness. But a single deep decision tree post showed BloodPressure at 6% importance. How can RFECV and a single tree disagree about the same feature's value? What does CV in RFECV add that a single-tree importance doesn't?
The interaction feature Glucose_x_BMI ranked 3rd with MDI=0.11. But this feature is a function of Glucose and BMI — it can't carry information that isn't already available from the original features. Why does Random Forest benefit from explicit interaction features even though trees can theoretically capture interactions through sequential splits on the two features?

Random Forest: Feature Importance and Feature Engineering

Impurity-Based Feature Importance (MDI)

MDI Weakness: Bias Toward High-Cardinality Features

Permutation Importance: Model-Agnostic Fix

MDI vs Permutation Importance

Feature Selection: SelectFromModel

Feature Selection: RFECV

Feature Engineering Guided by RF

Test Your Understanding

Comments (0)

Leave a comment