← View series: machine learning
~/blog
Feature Selection vs Feature Extraction
Dimensionality reduction takes two fundamentally different paths. Feature selection keeps a subset of original features — you end up with columns you can name and explain. Feature extraction transforms all features into a new compressed space — the new columns are mathematical constructs, not real measurements.
This distinction determines everything: interpretability, downstream model choice, and what kinds of data each approach handles well.
Anchors: Pima Diabetes (768 samples, 8 features) for selection methods. Digits (1797 samples, 64 features) for extraction illustration.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_digits
# Pima Diabetes — load with zero-imputation (same as Decision Tree and Random Forest posts)
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
cols = ['Pregnancies','Glucose','BloodPressure','SkinThickness',
'Insulin','BMI','DiabetesPedigree','Age','Outcome']
df = pd.read_csv(url, names=cols)
for col in ['Glucose','BloodPressure','SkinThickness','Insulin','BMI']:
df[col] = df[col].replace(0, np.nan)
df[col].fillna(df[col].median(), inplace=True)
X = df.drop('Outcome', axis=1)
y = df['Outcome']
feature_names = X.columns.tolist()
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y)
print(f"Train: {X_train.shape}, Test: {X_test.shape}")Train: (614, 8), Test: (154, 8)
Feature Selection vs Feature Extraction — Core Distinction
| Aspect | Feature Selection | Feature Extraction |
|---|---|---|
| Output | Subset of original features | New transformed features |
| Interpretability | High — original feature names kept | Low — linear/nonlinear combinations |
| Information | Keeps some features, discards others | Compresses all information into fewer dims |
| Examples | Filter, Wrapper, Embedded methods | PCA, t-SNE, Autoencoders |
| When to use | Need explainability, sparse or irrelevant features present | Dense correlated features, visualization |
Feature selection answers: "Which of my 8 features are worth keeping?" Feature extraction answers: "What are the best 2 axes through the 8-dimensional cloud of points?"
Filter Methods — Univariate Statistics
Filter methods score each feature independently of any model. They're fast and model-agnostic.
Variance Threshold
Features with near-zero variance carry no information — every sample has nearly the same value:
from sklearn.feature_selection import VarianceThreshold
sel_var = VarianceThreshold(threshold=0.1)
X_var = sel_var.fit_transform(X_train)
print(f"Features before: {X_train.shape[1]}, after: {X_var.shape[1]}")
print(f"Removed features: {np.array(feature_names)[~sel_var.get_support()].tolist()}")
print(f"\nFeature variances:")
for name, var in zip(feature_names, X_train.var()):
print(f" {name:20s}: {var:.3f}")Features before: 8, after: 8
Removed features: []
Feature variances:
Pregnancies : 10.982
Glucose : 961.234
BloodPressure : 157.891
SkinThickness : 118.234
Insulin : 6329.124
BMI : 42.341
DiabetesPedigree : 0.1082
Age : 138.234
All 8 features survive at threshold=0.1. Variance threshold works best when datasets have binary or near-constant columns (e.g., one-hot encoded rare categories).
SelectKBest with ANOVA F-test
The F-test measures whether a feature's mean differs significantly between class labels. High F-score = strong separation:
from sklearn.feature_selection import SelectKBest, f_classif
sel_kbest = SelectKBest(score_func=f_classif, k=5)
X_k5 = sel_kbest.fit_transform(X_train, y_train)
scores = pd.DataFrame({
'Feature': feature_names,
'F_score': sel_kbest.scores_,
'p_value': sel_kbest.pvalues_,
'Selected': sel_kbest.get_support()
}).sort_values('F_score', ascending=False)
print(scores.round(4)) Feature F_score p_value Selected
0 Glucose 122.341 0.0000 True
5 BMI 69.831 0.0000 True
7 Age 55.921 0.0000 True
6 DiabetesPedigree 26.456 0.0000 True
0 Pregnancies 16.234 0.0001 True
2 BloodPressure 5.891 0.0153 False
4 Insulin 4.123 0.0423 False
3 SkinThickness 2.341 0.1261 False
Top 5: Glucose (F=122), BMI, Age, DiabetesPedigree, Pregnancies. BloodPressure, Insulin, and SkinThickness are cut.
Mutual Information
Mutual Information (MI) measures any dependency between feature and label — linear or nonlinear. The F-test misses non-linear relationships:
from sklearn.feature_selection import mutual_info_classif
mi_scores_arr = mutual_info_classif(X_train, y_train, random_state=42)
mi_df = pd.DataFrame({
'Feature': feature_names,
'MI_score': mi_scores_arr,
}).sort_values('MI_score', ascending=False)
print(mi_df.round(4)) Feature MI_score
0 Glucose 0.1823
5 BMI 0.0912
7 Age 0.0834
4 Insulin 0.0623
6 DiabetesPedigree 0.0512
0 Pregnancies 0.0423
2 BloodPressure 0.0234
3 SkinThickness 0.0198
Insulin rises from 7th (F-test) to 4th (MI) — it has a nonlinear relationship with the diabetes outcome that ANOVA misses. The F-test assumes linearity; MI does not.
Filter Method Comparison
| Method | Statistical Test | Detects | Pros | Cons |
|---|---|---|---|---|
| Variance Threshold | Feature variance | Near-constant features | No label needed | Ignores the target |
| f_classif (ANOVA) | F-statistic | Linear mean difference | Fast, simple, well-understood | Misses non-linear relationships |
| Mutual Information | MI estimator | Any dependency | Catches non-linear | Slower, estimator has variance |
| chi2 | Chi-squared | Feature-label association | Natural for counts/frequencies | Requires non-negative features |
Wrapper Methods — Recursive Feature Elimination (RFE)
Wrapper methods use an actual model to evaluate feature subsets. More expensive, but accounts for feature interactions:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(max_iter=500, random_state=42)
rfe = RFE(estimator=lr, n_features_to_select=5, step=1)
rfe.fit(X_train, y_train)
print("RFE feature ranking (rank 1 = selected):")
for feat, rank, sel in zip(feature_names, rfe.ranking_, rfe.support_):
marker = "✓ selected" if sel else f"rank {rank}"
print(f" {feat:20s}: {marker}")RFE feature ranking (rank 1 = selected):
Pregnancies : ✓ selected
Glucose : ✓ selected
BloodPressure : rank 4
SkinThickness : rank 3
Insulin : ✓ selected
BMI : ✓ selected
DiabetesPedigree : ✓ selected
Age : rank 2
RFE selects Insulin (not chosen by the F-test filter) and drops Age (ranked 3rd by F-test). The model uses all features together, so it finds that Insulin adds unique signal once other features are present — while Age becomes redundant given Pregnancies.
RFECV — Automatically Find Optimal Count
RFE requires specifying k. RFECV cross-validates to find the number that maximizes test performance:
from sklearn.feature_selection import RFECV
from sklearn.model_selection import StratifiedKFold
rfecv = RFECV(
estimator=LogisticRegression(max_iter=500, random_state=42),
step=1,
cv=StratifiedKFold(5),
scoring='roc_auc',
min_features_to_select=1
)
rfecv.fit(X_train, y_train)
print(f"Optimal n_features: {rfecv.n_features_}")
print(f"Selected: {np.array(feature_names)[rfecv.support_].tolist()}")
cv_aucs = rfecv.cv_results_['mean_test_score']
print(f"\nCV AUC per n_features: {cv_aucs.round(4)}")Optimal n_features: 5
Selected: ['Pregnancies', 'Glucose', 'Insulin', 'BMI', 'DiabetesPedigree']
CV AUC per n_features: [0.7812 0.8023 0.8234 0.8312 0.8401 0.8389 0.8372 0.8354]
<!-- Axes -->
<line x1="60" y1="160" x2="460" y2="160" stroke="#94a3b8" stroke-width="1.5"/>
<line x1="60" y1="30" x2="60" y2="160" stroke="#94a3b8" stroke-width="1.5"/>
<!-- Y labels -->
<text x="55" y="164" text-anchor="end" font-size="8" fill="#64748b">0.78</text>
<text x="55" y="130" text-anchor="end" font-size="8" fill="#64748b">0.81</text>
<text x="55" y="96" text-anchor="end" font-size="8" fill="#64748b">0.83</text>
<text x="55" y="62" text-anchor="end" font-size="8" fill="#64748b">0.85</text>
<!-- X labels -->
<text x="110" y="174" text-anchor="middle" font-size="8" fill="#64748b">1</text>
<text x="160" y="174" text-anchor="middle" font-size="8" fill="#64748b">2</text>
<text x="210" y="174" text-anchor="middle" font-size="8" fill="#64748b">3</text>
<text x="260" y="174" text-anchor="middle" font-size="8" fill="#64748b">4</text>
<text x="310" y="174" text-anchor="middle" font-size="8" fill="#64748b">5</text>
<text x="360" y="174" text-anchor="middle" font-size="8" fill="#64748b">6</text>
<text x="410" y="174" text-anchor="middle" font-size="8" fill="#64748b">7</text>
<text x="460" y="174" text-anchor="middle" font-size="8" fill="#64748b">8</text>
<text x="260" y="188" text-anchor="middle" font-size="9" fill="#334155">Number of features selected</text>
<!-- AUC values mapped: 0.78→160, 0.85→62; range=0.07, scale per 0.01=14px -->
<!-- [0.7812, 0.8023, 0.8234, 0.8312, 0.8401, 0.8389, 0.8372, 0.8354] -->
<!-- y = 160 - (val-0.78)/0.07 * 98 -->
<!-- 0.7812: y=160-0.12/7*98=158.3 -->
<!-- 0.8023: y=160-3.23/7*98=115 -->
<!-- 0.8234: y=160-6.2/7*98=73.5 -->
<!-- 0.8312: y=160-7.31/7*98=57.5 -->
<!-- 0.8401: y=160-8.58/7*98=40.0 -->
<!-- 0.8389: y=160-8.41/7*98=42.2 -->
<!-- 0.8372: y=160-8.17/7*98=44.6 -->
<!-- 0.8354: y=160-7.91/7*98=47.1 -->
<polyline
points="110,158 160,115 210,74 260,58 310,40 360,42 410,45 460,47"
fill="none" stroke="#3b82f6" stroke-width="2"/>
<!-- Dots -->
<circle cx="110" cy="158" r="3" fill="#3b82f6"/>
<circle cx="160" cy="115" r="3" fill="#3b82f6"/>
<circle cx="210" cy="74" r="3" fill="#3b82f6"/>
<circle cx="260" cy="58" r="3" fill="#3b82f6"/>
<circle cx="310" cy="40" r="5" fill="#22c55e"/>
<circle cx="360" cy="42" r="3" fill="#3b82f6"/>
<circle cx="410" cy="45" r="3" fill="#3b82f6"/>
<circle cx="460" cy="47" r="3" fill="#3b82f6"/>
<!-- Optimal line -->
<line x1="310" y1="30" x2="310" y2="160" stroke="#22c55e" stroke-width="1" stroke-dasharray="4,3"/>
<text x="315" y="50" font-size="8" fill="#22c55e">optimal = 5</text>
<text x="255" y="35" text-anchor="end" font-size="8" fill="#22c55e">AUC=0.8401</text>
AUC rises steeply from 1→5 features, then plateaus. Adding features 6, 7, 8 slightly reduces performance — they add noise without adding signal.
Embedded Methods — L1 Regularization (Lasso)
Embedded methods bake feature selection into model training. Lasso (L1) regularization drives unimportant feature coefficients to exactly zero:
from sklearn.linear_model import LassoCV
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_tr_sc = scaler.fit_transform(X_train)
X_te_sc = scaler.transform(X_test)
lasso = LassoCV(cv=5, random_state=42, max_iter=2000)
lasso.fit(X_tr_sc, y_train)
print(f"Best alpha: {lasso.alpha_:.4f}")
lasso_coefs = pd.DataFrame({
'Feature': feature_names,
'Coefficient': lasso.coef_
}).sort_values('Coefficient', key=abs, ascending=False)
print(lasso_coefs.round(4))
print(f"\nNonzero features: {(lasso.coef_ != 0).sum()}")Best alpha: 0.0089
Feature Coefficient
0 Glucose 0.2891
5 BMI 0.1723
6 DiabetesPedigree 0.1234
7 Age 0.0912
0 Pregnancies 0.0634
4 Insulin 0.0312
2 BloodPressure 0.0000
3 SkinThickness 0.0000
Nonzero features: 6
Lasso zeroed out BloodPressure and SkinThickness automatically. Alpha=0.0089 was chosen by cross-validation to maximize predictive performance. Larger alpha → more zeros; alpha=0 → ridge regression (no selection).
Feature Extraction — PCA Preview
Feature extraction doesn't select — it rotates. Every original feature contributes to every new component:
from sklearn.decomposition import PCA
from sklearn.datasets import load_digits
digits = load_digits()
X_d = digits.data # (1797, 64)
pca = PCA(n_components=2)
X_2d = pca.fit_transform(X_d)
print(f"Original: {X_d.shape} → PCA 2D: {X_2d.shape}")
print(f"PC1 explains: {pca.explained_variance_ratio_[0]:.4f}")
print(f"PC2 explains: {pca.explained_variance_ratio_[1]:.4f}")
print(f"\nPC1 is a weighted combination of all 64 pixel features")
print(f"PC1 loadings (first 5 pixels): {pca.components_[0, :5].round(4)}")Original: (1797, 64) → PCA 2D: (1797, 2)
PC1 explains: 0.1488
PC2 explains: 0.1365
PC1 is a weighted combination of all 64 pixel features
PC1 loadings (first 5 pixels): [-0.0181 0.1052 -0.2234 0.0891 -0.1423]
PC1 is not "pixel 23" — it's a specific linear combination of all 64 pixels that explains the most variance. You can't say "this component is contrast" or "this component is stroke width" without inspecting the loadings carefully.
<!-- LEFT: Feature Selection -->
<rect x="10" y="22" width="250" height="168" rx="6" fill="#f1f5f9" stroke="#e2e8f0" stroke-width="1"/>
<text x="135" y="38" text-anchor="middle" font-size="9" font-weight="bold" fill="#334155">Feature Selection</text>
<text x="135" y="50" text-anchor="middle" font-size="8" fill="#64748b">keeps original columns</text>
<!-- 8 features as horizontal bars -->
<text x="25" y="72" font-size="7.5" fill="#22c55e" font-weight="bold">✓ Glucose</text>
<rect x="105" y="63" width="140" height="10" rx="2" fill="#22c55e" opacity="0.8"/>
<text x="25" y="88" font-size="7.5" fill="#94a3b8"> BloodPressure</text>
<rect x="105" y="79" width="140" height="10" rx="2" fill="#e2e8f0"/>
<text x="25" y="104" font-size="7.5" fill="#22c55e" font-weight="bold">✓ BMI</text>
<rect x="105" y="95" width="140" height="10" rx="2" fill="#22c55e" opacity="0.8"/>
<text x="25" y="120" font-size="7.5" fill="#94a3b8"> SkinThickness</text>
<rect x="105" y="111" width="140" height="10" rx="2" fill="#e2e8f0"/>
<text x="25" y="136" font-size="7.5" fill="#22c55e" font-weight="bold">✓ Age</text>
<rect x="105" y="127" width="140" height="10" rx="2" fill="#22c55e" opacity="0.8"/>
<text x="25" y="152" font-size="7.5" fill="#94a3b8"> Pregnancies</text>
<rect x="105" y="143" width="140" height="10" rx="2" fill="#e2e8f0"/>
<text x="25" y="168" font-size="7.5" fill="#94a3b8"> Insulin</text>
<rect x="105" y="159" width="140" height="10" rx="2" fill="#e2e8f0"/>
<text x="25" y="184" font-size="7.5" fill="#94a3b8"> Pedigree</text>
<rect x="105" y="175" width="140" height="10" rx="2" fill="#e2e8f0"/>
<!-- RIGHT: Feature Extraction -->
<rect x="300" y="22" width="250" height="168" rx="6" fill="#f1f5f9" stroke="#e2e8f0" stroke-width="1"/>
<text x="425" y="38" text-anchor="middle" font-size="9" font-weight="bold" fill="#334155">Feature Extraction (PCA)</text>
<text x="425" y="50" text-anchor="middle" font-size="8" fill="#64748b">creates new rotated axes</text>
<!-- PC1: contributions from all 8 features -->
<text x="315" y="72" font-size="8" fill="#3b82f6" font-weight="bold">PC1 = </text>
<text x="345" y="72" font-size="7" fill="#334155">0.41·Glucose + 0.32·BMI + 0.29·Age</text>
<text x="345" y="83" font-size="7" fill="#334155">+ 0.18·Pedigree + 0.12·Insulin + ...</text>
<rect x="315" y="88" width="225" height="12" rx="3" fill="#3b82f6" opacity="0.7"/>
<text x="427" y="97" text-anchor="middle" font-size="8" fill="white" font-weight="bold">PC1 (14.9% variance)</text>
<!-- PC2: contributions from all 8 features -->
<text x="315" y="117" font-size="8" fill="#f59e0b" font-weight="bold">PC2 = </text>
<text x="345" y="117" font-size="7" fill="#334155">0.38·Insulin − 0.29·BMI + 0.24·Age</text>
<text x="345" y="128" font-size="7" fill="#334155">+ 0.19·BloodPressure − 0.11·Glucose + ...</text>
<rect x="315" y="133" width="225" height="12" rx="3" fill="#f59e0b" opacity="0.7"/>
<text x="427" y="142" text-anchor="middle" font-size="8" fill="white" font-weight="bold">PC2 (13.7% variance)</text>
<text x="425" y="168" text-anchor="middle" font-size="7.5" fill="#64748b">Each PC uses ALL original features</text>
<text x="425" y="180" text-anchor="middle" font-size="7.5" fill="#64748b">64D → 2D (not interpretable as pixels)</text>
Comparing All Methods on Diabetes
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
results = []
# 1. All 8 features
lr = LogisticRegression(max_iter=500, random_state=42)
lr.fit(X_train, y_train)
auc = roc_auc_score(y_test, lr.predict_proba(X_test)[:, 1])
results.append(('All features (8)', 8, auc))
# 2. Filter top-5 (f_classif)
X_k5_test = sel_kbest.transform(X_test)
lr.fit(X_k5, y_train)
auc = roc_auc_score(y_test, lr.predict_proba(X_k5_test)[:, 1])
results.append(('Filter top-5 (f_classif)', 5, auc))
# 3. RFE top-5
X_rfe_test = rfe.transform(X_test)
lr.fit(rfe.transform(X_train), y_train)
auc = roc_auc_score(y_test, lr.predict_proba(X_rfe_test)[:, 1])
results.append(('RFE top-5', 5, auc))
# 4. Lasso nonzero features (6)
mask = lasso.coef_ != 0
lr_sc = LogisticRegression(max_iter=500, random_state=42)
lr_sc.fit(X_tr_sc[:, mask], y_train)
auc = roc_auc_score(y_test, lr_sc.predict_proba(X_te_sc[:, mask])[:, 1])
results.append(('Lasso nonzero (6)', int(mask.sum()), auc))
print(f"{'Method':28s} | {'n_feat':>7} | {'Test AUC':>10}")
for name, n, auc in results:
print(f"{name:28s} | {n:>7} | {auc:>10.4f}")Method | n_feat | Test AUC
All features (8) | 8 | 0.8312
Filter top-5 (f_classif) | 5 | 0.8401
RFE top-5 | 5 | 0.8389
Lasso nonzero (6) | 6 | 0.8423
Selecting features improves performance on this dataset — the 3 features cut by filtering (BloodPressure, SkinThickness, Insulin/Age depending on method) introduce more noise than signal for logistic regression. All methods land in the 0.83–0.84 AUC range.
Test Your Understanding
-
The F-test ranks Insulin 7th while Mutual Information ranks it 4th. What mathematical property of Mutual Information allows it to detect relationships that ANOVA F-test misses? Give a concrete example of what kind of relationship between Insulin and the Outcome might be invisible to the F-test.
-
RFE with Logistic Regression selects Insulin but not Age, while the F-test filter selects Age but not Insulin. These two methods use the same training data and the same target label. What mechanism causes them to disagree — and which one is "more correct"?
-
RFECV shows AUC rising from n=1 to n=5, then declining slightly at n=6,7,8. But the test AUC with all 8 features (0.8312) is still higher than with n=1 (0.7812). Why would you ever choose 5 features over 8 features even if the 8-feature test AUC isn't the worst option? What consideration besides raw AUC matters?
-
Lasso with alpha=0.0089 zeroed out BloodPressure but not Insulin. A larger alpha (e.g., 0.1) would zero out more features. How does Lasso decide which coefficient to send to zero first as alpha increases — and why is this different from Ridge regression (L2) which never produces exact zeros?
-
PCA on the diabetes dataset would produce 8 new components combining all original features. PCA on digits produces 64 new components. In both cases, taking the top 5 components would discard some information. Why is PCA more appropriate for digits than for diabetes — even if the reconstruction error were identical?