← View series: machine learning
~/blog
Decision Tree: Diabetes Prediction Project
Theory is complete. This post runs a decision tree end-to-end on a real clinical dataset: Pima Indians Diabetes — predicting diabetes diagnosis from health measurements. It follows the full ML workflow: data quality, EDA, baseline, tuning, evaluation, and threshold optimization for clinical use.
Anchor dataset: Pima Indians Diabetes — 768 samples, 8 features, binary outcome (1=diabetic).
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import (classification_report, confusion_matrix,
roc_auc_score, accuracy_score)
from sklearn.dummy import DummyClassifier
columns = ['Pregnancies','Glucose','BloodPressure','SkinThickness',
'Insulin','BMI','DiabetesPedigree','Age','Outcome']
df = pd.read_csv('pima-indians-diabetes.csv', names=columns)
print(df.shape)
print(df['Outcome'].value_counts())(768, 9)
Outcome
0 500
1 268
Class distribution: 500 non-diabetic (65.1%), 268 diabetic (34.9%). Mild imbalance — a baseline classifier that always predicts non-diabetic achieves 65.1% accuracy.
Step 1: Data Quality — The Zero Problem
Several features have biologically impossible zero values. Glucose = 0 would be fatal; BMI = 0 is impossible. These are missing values encoded as zeros.
zero_cols = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']
print("Zero counts in biological features:")
for col in zero_cols:
n_zero = (df[col] == 0).sum()
print(f" {col:20s}: {n_zero} zeros ({n_zero/len(df)*100:.1f}%)")Zero counts in biological features:
Glucose : 5 zeros (0.7%)
BloodPressure : 35 zeros (4.6%)
SkinThickness :227 zeros (29.6%)
Insulin :374 zeros (48.7%)
BMI : 11 zeros (1.4%)
Insulin has 374 zeros (49% of rows) — nearly half the dataset has missing insulin measurements. Replace zeros with NaN and impute with the column median (robust to outliers that remain).
df[zero_cols] = df[zero_cols].replace(0, np.nan)
for col in zero_cols:
df[col].fillna(df[col].median(), inplace=True)
print(f"After imputation: {df.isnull().sum().sum()} nulls")After imputation: 0 nulls
Step 2: EDA — Feature Correlation with Outcome
corr = df.corr()['Outcome'].drop('Outcome').sort_values(ascending=False)
print(corr.round(3))Glucose 0.466
BMI 0.293
Age 0.238
Pregnancies 0.222
DiabetesPedigree 0.173
Insulin 0.130
SkinThickness 0.074
BloodPressure 0.065
Glucose has the strongest correlation (0.47) — expect it to be the root split. BMI and Age follow. BloodPressure has almost no linear relationship with diabetes outcome despite being a standard clinical risk factor.
Step 3: Train/Test Split
X = df.drop('Outcome', axis=1)
y = df['Outcome']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"Train: {X_train.shape}, Test: {X_test.shape}")
print(f"Train diabetic rate: {y_train.mean():.3f}, Test: {y_test.mean():.3f}")Train: (614, 8), Test: (154, 8)
Train diabetic rate: 0.349, Test: 0.350
stratify=y preserves the 35% diabetic rate in both splits.
Step 4: Baseline and Overfitting Gap
# Baseline: always predict majority class
dummy = DummyClassifier(strategy='most_frequent')
dummy.fit(X_train, y_train)
print(f"Baseline accuracy: {dummy.score(X_test, y_test):.4f}")
# Unpruned tree
dt_full = DecisionTreeClassifier(random_state=42)
dt_full.fit(X_train, y_train)
print(f"Unpruned — train: {dt_full.score(X_train, y_train):.4f}, test: {dt_full.score(X_test, y_test):.4f}")
print(f"Unpruned — depth: {dt_full.get_depth()}, leaves: {dt_full.get_n_leaves()}")Baseline accuracy: 0.6494
Unpruned — train: 1.0000, test: 0.7208
Unpruned — depth: 16, leaves: 173
The unpruned tree memorizes all 614 training samples (173 leaves for 614 samples). Test accuracy (72.1%) barely beats the baseline (64.9%) — the tree has overfit badly.
Step 5: Hyperparameter Tuning
param_grid = {
'max_depth': [3, 4, 5, 6, None],
'min_samples_split': [2, 10, 20],
'min_samples_leaf': [1, 5, 10],
'criterion': ['gini', 'entropy'],
}
gs = GridSearchCV(
DecisionTreeClassifier(random_state=42),
param_grid, cv=10, scoring='roc_auc', n_jobs=-1
)
gs.fit(X_train, y_train)
print(f"Best params: {gs.best_params_}")
print(f"Best CV AUC: {gs.best_score_:.4f}")Best params: {'criterion': 'gini', 'max_depth': 4, 'min_samples_leaf': 5, 'min_samples_split': 10}
Best CV AUC: 0.8310
Step 6: Final Evaluation
best_dt = gs.best_estimator_
y_pred = best_dt.predict(X_test)
y_prob = best_dt.predict_proba(X_test)[:, 1]
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print()
print(classification_report(y_test, y_pred, target_names=['Non-Diabetic', 'Diabetic']))
print(f"AUC-ROC: {roc_auc_score(y_test, y_prob):.4f}")Confusion Matrix:
[[88 12]
[22 32]]
precision recall f1-score support
Non-Diabetic 0.80 0.88 0.84 100
Diabetic 0.73 0.59 0.65 54
accuracy 0.78 154
AUC-ROC: 0.8309
Reading the confusion matrix:
- TP = 32: diabetics correctly identified
- TN = 88: non-diabetics correctly identified
- FP = 12: healthy patients flagged as diabetic (unnecessary follow-up)
- FN = 22: diabetics predicted as healthy — the clinical danger
FN = 22 means 22 diabetic patients receive a false all-clear. In a clinical screening tool, this is unacceptable. The default threshold of 0.5 optimizes accuracy, not recall.
Step 7: Feature Importance
importances = pd.Series(best_dt.feature_importances_, index=X.columns)
importances_sorted = importances.sort_values(ascending=False)
print(importances_sorted.round(4))Glucose 0.3241
BMI 0.1892
Age 0.1543
DiabetesPedigree 0.1012
Pregnancies 0.0891
BloodPressure 0.0623
Insulin 0.0512
SkinThickness 0.0286
<text x="115" y="34" text-anchor="end" font-size="9" fill="#334155">Glucose</text>
<rect x="120" y="22" width="259" height="18" fill="#3b82f6" rx="2"/>
<text x="385" y="35" font-size="9" fill="#334155">0.324</text>
<text x="115" y="60" text-anchor="end" font-size="9" fill="#334155">BMI</text>
<rect x="120" y="48" width="151" height="18" fill="#3b82f6" rx="2" opacity="0.8"/>
<text x="276" y="61" font-size="9" fill="#334155">0.189</text>
<text x="115" y="86" text-anchor="end" font-size="9" fill="#334155">Age</text>
<rect x="120" y="74" width="123" height="18" fill="#3b82f6" rx="2" opacity="0.7"/>
<text x="248" y="87" font-size="9" fill="#334155">0.154</text>
<text x="115" y="112" text-anchor="end" font-size="9" fill="#334155">DiabetesPedigree</text>
<rect x="120" y="100" width="81" height="18" fill="#3b82f6" rx="2" opacity="0.6"/>
<text x="206" y="113" font-size="9" fill="#334155">0.101</text>
<text x="115" y="138" text-anchor="end" font-size="9" fill="#334155">Pregnancies</text>
<rect x="120" y="126" width="71" height="18" fill="#3b82f6" rx="2" opacity="0.5"/>
<text x="196" y="139" font-size="9" fill="#334155">0.089</text>
<text x="115" y="164" text-anchor="end" font-size="9" fill="#334155">BloodPressure</text>
<rect x="120" y="152" width="50" height="18" fill="#94a3b8" rx="2" opacity="0.5"/>
<text x="175" y="165" font-size="9" fill="#334155">0.062</text>
<text x="115" y="190" text-anchor="end" font-size="9" fill="#334155">Insulin</text>
<rect x="120" y="178" width="41" height="18" fill="#94a3b8" rx="2" opacity="0.5"/>
<text x="166" y="191" font-size="9" fill="#334155">0.051</text>
<text x="115" y="216" text-anchor="end" font-size="9" fill="#334155">SkinThickness</text>
<rect x="120" y="204" width="23" height="18" fill="#94a3b8" rx="2" opacity="0.4"/>
<text x="148" y="217" font-size="9" fill="#334155">0.029</text>
Glucose (32%) and BMI (19%) together account for 51% of the model's decision-making. This matches clinical knowledge: high blood glucose is the primary diabetes indicator, and BMI is a key risk factor. BloodPressure, despite its correlation (0.065), contributes only 6% — the tree found better splits elsewhere.
Step 8: Tree Visualization
from sklearn.tree import export_text
print(export_text(best_dt, feature_names=list(X.columns), max_depth=3))|--- Glucose <= 127.50
| |--- BMI <= 29.95
| | |--- Age <= 28.50
| | | |--- class: 0
| | |--- Age > 28.50
| | | |--- class: 0
| |--- BMI > 29.95
| | |--- DiabetesPedigree <= 0.53
| | | |--- class: 0
| | |--- DiabetesPedigree > 0.53
| | | |--- class: 1
|--- Glucose > 127.50
| |--- BMI <= 29.95
| | |--- Age <= 28.50
| | | |--- class: 0
| | |--- Age > 28.50
| | | |--- class: 1
| |--- BMI > 29.95
| | |--- ...
Root split: Glucose ≤ 127.5 — low glucose → mostly non-diabetic (left). High glucose → deeper splits by BMI, then Age and DiabetesPedigree. The first 3 levels tell a clinically coherent story.
Step 9: Threshold Tuning for Clinical Use
At threshold 0.5: diabetic recall = 59%, FN = 22. For a screening tool, missing 22 diabetics is unacceptable. Lower the threshold to increase recall at the cost of precision:
from sklearn.metrics import precision_recall_curve
precisions, recalls, thresholds = precision_recall_curve(y_test, y_prob)
# Find threshold where diabetic recall >= 0.75
target_recall = 0.75
idx = np.argmax(recalls[:-1] >= target_recall)
best_t = thresholds[idx]
print(f"Threshold for recall ≥ 0.75: {best_t:.4f}")
y_pred_adj = (y_prob >= best_t).astype(int)
print(classification_report(y_test, y_pred_adj, target_names=['Non-Diabetic', 'Diabetic']))Threshold for recall ≥ 0.75: 0.3500
precision recall f1-score support
Non-Diabetic 0.88 0.79 0.83 100
Diabetic 0.65 0.78 0.71 54
accuracy 0.78 154
At threshold 0.35: diabetic recall improves from 59% to 78%, FN drops from 22 to 12. FP increases from 12 to 21. Precision decreases from 73% to 65%. The overall accuracy stays at 78% — the threshold shifts which errors we make, not the total count.
Project Summary
| Step | Action | Key Finding |
|---|---|---|
| EDA | Check zero values | 374 Insulin zeros (48.7%) → imputed with median |
| EDA | Correlation | Glucose (0.47) strongest predictor |
| Baseline | DummyClassifier | 64.9% — minimum acceptable bar |
| Unpruned | DecisionTreeClassifier | Train=1.0, Test=0.72 — severe overfit |
| Tuning | GridSearchCV (AUC) | Best: Gini, depth=4, min_leaf=5, min_split=10 |
| Evaluation | Confusion matrix | FN=22 — clinical danger, recall=59% |
| Features | Gini importance | Glucose (32%), BMI (19%) — top 2 |
| Threshold | PR curve at recall≥0.75 | t=0.35, FN drops to 12, recall=78% |
Test Your Understanding
-
Insulin has 374 zeros (49% imputed with median). After imputation, all insulin values in the top half of the dataset equal the median. How does this affect the Gini importance of Insulin? Does imputing with median increase or decrease its apparent predictive value?
-
The root splits at Glucose ≤ 127.5. This is a threshold found during training by maximizing IG. If you re-ran the model with a different
random_stateintrain_test_split, would the root threshold change? What factors determine the root threshold stability? -
At threshold 0.5: Precision=0.73, Recall=0.59 for diabetics. At threshold 0.35: Precision=0.65, Recall=0.78. Compute the F₂ score (which weights recall 4× as much as precision) for both thresholds: . Which threshold is better by F₂?
-
Feature importance is computed as the total Gini reduction attributable to each feature, normalized to sum to 1. A feature that appears in multiple splits (at different depths) accumulates importance. If Glucose appears only at the root but BMI appears at 5 nodes deeper in the tree, can BMI have higher importance than Glucose? Under what conditions?
-
The tuned tree achieves AUC=0.83. A logistic regression on the same data typically achieves AUC≈0.84. Decision trees are often weaker than logistic regression on tabular data. Why might you still choose the decision tree for a clinical application despite lower AUC?