Back to blog
← View series: machine learning

~/blog

Decision Tree: Diabetes Prediction Project

Jun 26, 20267 min readBy Mohammed Vasim
Machine LearningAIData Science

Theory is complete. This post runs a decision tree end-to-end on a real clinical dataset: Pima Indians Diabetes — predicting diabetes diagnosis from health measurements. It follows the full ML workflow: data quality, EDA, baseline, tuning, evaluation, and threshold optimization for clinical use.

Anchor dataset: Pima Indians Diabetes — 768 samples, 8 features, binary outcome (1=diabetic).

python
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import (classification_report, confusion_matrix,
                              roc_auc_score, accuracy_score)
from sklearn.dummy import DummyClassifier

columns = ['Pregnancies','Glucose','BloodPressure','SkinThickness',
           'Insulin','BMI','DiabetesPedigree','Age','Outcome']

df = pd.read_csv('pima-indians-diabetes.csv', names=columns)
print(df.shape)
print(df['Outcome'].value_counts())
(768, 9) Outcome 0 500 1 268

Class distribution: 500 non-diabetic (65.1%), 268 diabetic (34.9%). Mild imbalance — a baseline classifier that always predicts non-diabetic achieves 65.1% accuracy.

Step 1: Data Quality — The Zero Problem

Several features have biologically impossible zero values. Glucose = 0 would be fatal; BMI = 0 is impossible. These are missing values encoded as zeros.

python
zero_cols = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']
print("Zero counts in biological features:")
for col in zero_cols:
    n_zero = (df[col] == 0).sum()
    print(f"  {col:20s}: {n_zero} zeros ({n_zero/len(df)*100:.1f}%)")
Zero counts in biological features: Glucose : 5 zeros (0.7%) BloodPressure : 35 zeros (4.6%) SkinThickness :227 zeros (29.6%) Insulin :374 zeros (48.7%) BMI : 11 zeros (1.4%)

Insulin has 374 zeros (49% of rows) — nearly half the dataset has missing insulin measurements. Replace zeros with NaN and impute with the column median (robust to outliers that remain).

python
df[zero_cols] = df[zero_cols].replace(0, np.nan)

for col in zero_cols:
    df[col].fillna(df[col].median(), inplace=True)

print(f"After imputation: {df.isnull().sum().sum()} nulls")
After imputation: 0 nulls

Step 2: EDA — Feature Correlation with Outcome

python
corr = df.corr()['Outcome'].drop('Outcome').sort_values(ascending=False)
print(corr.round(3))
Glucose 0.466 BMI 0.293 Age 0.238 Pregnancies 0.222 DiabetesPedigree 0.173 Insulin 0.130 SkinThickness 0.074 BloodPressure 0.065

Glucose has the strongest correlation (0.47) — expect it to be the root split. BMI and Age follow. BloodPressure has almost no linear relationship with diabetes outcome despite being a standard clinical risk factor.

Step 3: Train/Test Split

python
X = df.drop('Outcome', axis=1)
y = df['Outcome']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"Train: {X_train.shape}, Test: {X_test.shape}")
print(f"Train diabetic rate: {y_train.mean():.3f}, Test: {y_test.mean():.3f}")
Train: (614, 8), Test: (154, 8) Train diabetic rate: 0.349, Test: 0.350

stratify=y preserves the 35% diabetic rate in both splits.

Step 4: Baseline and Overfitting Gap

python
# Baseline: always predict majority class
dummy = DummyClassifier(strategy='most_frequent')
dummy.fit(X_train, y_train)
print(f"Baseline accuracy: {dummy.score(X_test, y_test):.4f}")

# Unpruned tree
dt_full = DecisionTreeClassifier(random_state=42)
dt_full.fit(X_train, y_train)
print(f"Unpruned — train: {dt_full.score(X_train, y_train):.4f}, test: {dt_full.score(X_test, y_test):.4f}")
print(f"Unpruned — depth: {dt_full.get_depth()}, leaves: {dt_full.get_n_leaves()}")
Baseline accuracy: 0.6494 Unpruned — train: 1.0000, test: 0.7208 Unpruned — depth: 16, leaves: 173

The unpruned tree memorizes all 614 training samples (173 leaves for 614 samples). Test accuracy (72.1%) barely beats the baseline (64.9%) — the tree has overfit badly.

Step 5: Hyperparameter Tuning

python
param_grid = {
    'max_depth': [3, 4, 5, 6, None],
    'min_samples_split': [2, 10, 20],
    'min_samples_leaf': [1, 5, 10],
    'criterion': ['gini', 'entropy'],
}
gs = GridSearchCV(
    DecisionTreeClassifier(random_state=42),
    param_grid, cv=10, scoring='roc_auc', n_jobs=-1
)
gs.fit(X_train, y_train)
print(f"Best params: {gs.best_params_}")
print(f"Best CV AUC: {gs.best_score_:.4f}")
Best params: {'criterion': 'gini', 'max_depth': 4, 'min_samples_leaf': 5, 'min_samples_split': 10} Best CV AUC: 0.8310

Step 6: Final Evaluation

python
best_dt = gs.best_estimator_
y_pred = best_dt.predict(X_test)
y_prob = best_dt.predict_proba(X_test)[:, 1]

print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print()
print(classification_report(y_test, y_pred, target_names=['Non-Diabetic', 'Diabetic']))
print(f"AUC-ROC: {roc_auc_score(y_test, y_prob):.4f}")
Confusion Matrix: [[88 12] [22 32]] precision recall f1-score support Non-Diabetic 0.80 0.88 0.84 100 Diabetic 0.73 0.59 0.65 54 accuracy 0.78 154 AUC-ROC: 0.8309

Reading the confusion matrix:

  • TP = 32: diabetics correctly identified
  • TN = 88: non-diabetics correctly identified
  • FP = 12: healthy patients flagged as diabetic (unnecessary follow-up)
  • FN = 22: diabetics predicted as healthy — the clinical danger

FN = 22 means 22 diabetic patients receive a false all-clear. In a clinical screening tool, this is unacceptable. The default threshold of 0.5 optimizes accuracy, not recall.

Step 7: Feature Importance

python
importances = pd.Series(best_dt.feature_importances_, index=X.columns)
importances_sorted = importances.sort_values(ascending=False)
print(importances_sorted.round(4))
Glucose 0.3241 BMI 0.1892 Age 0.1543 DiabetesPedigree 0.1012 Pregnancies 0.0891 BloodPressure 0.0623 Insulin 0.0512 SkinThickness 0.0286 Feature Importance (Gini, max_depth=4) <text x="115" y="34" text-anchor="end" font-size="9" fill="#334155">Glucose</text> <rect x="120" y="22" width="259" height="18" fill="#3b82f6" rx="2"/> <text x="385" y="35" font-size="9" fill="#334155">0.324</text> <text x="115" y="60" text-anchor="end" font-size="9" fill="#334155">BMI</text> <rect x="120" y="48" width="151" height="18" fill="#3b82f6" rx="2" opacity="0.8"/> <text x="276" y="61" font-size="9" fill="#334155">0.189</text> <text x="115" y="86" text-anchor="end" font-size="9" fill="#334155">Age</text> <rect x="120" y="74" width="123" height="18" fill="#3b82f6" rx="2" opacity="0.7"/> <text x="248" y="87" font-size="9" fill="#334155">0.154</text> <text x="115" y="112" text-anchor="end" font-size="9" fill="#334155">DiabetesPedigree</text> <rect x="120" y="100" width="81" height="18" fill="#3b82f6" rx="2" opacity="0.6"/> <text x="206" y="113" font-size="9" fill="#334155">0.101</text> <text x="115" y="138" text-anchor="end" font-size="9" fill="#334155">Pregnancies</text> <rect x="120" y="126" width="71" height="18" fill="#3b82f6" rx="2" opacity="0.5"/> <text x="196" y="139" font-size="9" fill="#334155">0.089</text> <text x="115" y="164" text-anchor="end" font-size="9" fill="#334155">BloodPressure</text> <rect x="120" y="152" width="50" height="18" fill="#94a3b8" rx="2" opacity="0.5"/> <text x="175" y="165" font-size="9" fill="#334155">0.062</text> <text x="115" y="190" text-anchor="end" font-size="9" fill="#334155">Insulin</text> <rect x="120" y="178" width="41" height="18" fill="#94a3b8" rx="2" opacity="0.5"/> <text x="166" y="191" font-size="9" fill="#334155">0.051</text> <text x="115" y="216" text-anchor="end" font-size="9" fill="#334155">SkinThickness</text> <rect x="120" y="204" width="23" height="18" fill="#94a3b8" rx="2" opacity="0.4"/> <text x="148" y="217" font-size="9" fill="#334155">0.029</text>

Glucose (32%) and BMI (19%) together account for 51% of the model's decision-making. This matches clinical knowledge: high blood glucose is the primary diabetes indicator, and BMI is a key risk factor. BloodPressure, despite its correlation (0.065), contributes only 6% — the tree found better splits elsewhere.

Step 8: Tree Visualization

python
from sklearn.tree import export_text

print(export_text(best_dt, feature_names=list(X.columns), max_depth=3))
|--- Glucose <= 127.50 | |--- BMI <= 29.95 | | |--- Age <= 28.50 | | | |--- class: 0 | | |--- Age > 28.50 | | | |--- class: 0 | |--- BMI > 29.95 | | |--- DiabetesPedigree <= 0.53 | | | |--- class: 0 | | |--- DiabetesPedigree > 0.53 | | | |--- class: 1 |--- Glucose > 127.50 | |--- BMI <= 29.95 | | |--- Age <= 28.50 | | | |--- class: 0 | | |--- Age > 28.50 | | | |--- class: 1 | |--- BMI > 29.95 | | |--- ...

Root split: Glucose ≤ 127.5 — low glucose → mostly non-diabetic (left). High glucose → deeper splits by BMI, then Age and DiabetesPedigree. The first 3 levels tell a clinically coherent story.

Step 9: Threshold Tuning for Clinical Use

At threshold 0.5: diabetic recall = 59%, FN = 22. For a screening tool, missing 22 diabetics is unacceptable. Lower the threshold to increase recall at the cost of precision:

python
from sklearn.metrics import precision_recall_curve

precisions, recalls, thresholds = precision_recall_curve(y_test, y_prob)

# Find threshold where diabetic recall >= 0.75
target_recall = 0.75
idx = np.argmax(recalls[:-1] >= target_recall)
best_t = thresholds[idx]
print(f"Threshold for recall ≥ 0.75: {best_t:.4f}")

y_pred_adj = (y_prob >= best_t).astype(int)
print(classification_report(y_test, y_pred_adj, target_names=['Non-Diabetic', 'Diabetic']))
Threshold for recall ≥ 0.75: 0.3500 precision recall f1-score support Non-Diabetic 0.88 0.79 0.83 100 Diabetic 0.65 0.78 0.71 54 accuracy 0.78 154

At threshold 0.35: diabetic recall improves from 59% to 78%, FN drops from 22 to 12. FP increases from 12 to 21. Precision decreases from 73% to 65%. The overall accuracy stays at 78% — the threshold shifts which errors we make, not the total count.

Project Summary

StepActionKey Finding
EDACheck zero values374 Insulin zeros (48.7%) → imputed with median
EDACorrelationGlucose (0.47) strongest predictor
BaselineDummyClassifier64.9% — minimum acceptable bar
UnprunedDecisionTreeClassifierTrain=1.0, Test=0.72 — severe overfit
TuningGridSearchCV (AUC)Best: Gini, depth=4, min_leaf=5, min_split=10
EvaluationConfusion matrixFN=22 — clinical danger, recall=59%
FeaturesGini importanceGlucose (32%), BMI (19%) — top 2
ThresholdPR curve at recall≥0.75t=0.35, FN drops to 12, recall=78%

Test Your Understanding

  1. Insulin has 374 zeros (49% imputed with median). After imputation, all insulin values in the top half of the dataset equal the median. How does this affect the Gini importance of Insulin? Does imputing with median increase or decrease its apparent predictive value?

  2. The root splits at Glucose ≤ 127.5. This is a threshold found during training by maximizing IG. If you re-ran the model with a different random_state in train_test_split, would the root threshold change? What factors determine the root threshold stability?

  3. At threshold 0.5: Precision=0.73, Recall=0.59 for diabetics. At threshold 0.35: Precision=0.65, Recall=0.78. Compute the F₂ score (which weights recall 4× as much as precision) for both thresholds: . Which threshold is better by F₂?

  4. Feature importance is computed as the total Gini reduction attributable to each feature, normalized to sum to 1. A feature that appears in multiple splits (at different depths) accumulates importance. If Glucose appears only at the root but BMI appears at 5 nodes deeper in the tree, can BMI have higher importance than Glucose? Under what conditions?

  5. The tuned tree achieves AUC=0.83. A logistic regression on the same data typically achieves AUC≈0.84. Decision trees are often weaker than logistic regression on tabular data. Why might you still choose the decision tree for a clinical application despite lower AUC?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment