Back to blog
← View series: machine learning

Decision Trees: Entropy and Gini Impurity Information Gain and Full Tree Construction Splitting Numerical Features in Decision Trees Decision Tree Pruning: Pre-Pruning and Post-Pruning Decision Tree Regression Decision Tree: Diabetes Prediction Project

~/blog

Decision Tree: Diabetes Prediction Project

Jun 26, 2026•7 min read•By Mohammed Vasim

Machine LearningAIData Science

Theory is complete. This post runs a decision tree end-to-end on a real clinical dataset: Pima Indians Diabetes — predicting diabetes diagnosis from health measurements. It follows the full ML workflow: data quality, EDA, baseline, tuning, evaluation, and threshold optimization for clinical use.

Anchor dataset: Pima Indians Diabetes — 768 samples, 8 features, binary outcome (1=diabetic).

python

import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import (classification_report, confusion_matrix,
                              roc_auc_score, accuracy_score)
from sklearn.dummy import DummyClassifier

columns = ['Pregnancies','Glucose','BloodPressure','SkinThickness',
           'Insulin','BMI','DiabetesPedigree','Age','Outcome']

df = pd.read_csv('pima-indians-diabetes.csv', names=columns)
print(df.shape)
print(df['Outcome'].value_counts())

(768, 9)
Outcome
0    500
1    268

Class distribution: 500 non-diabetic (65.1%), 268 diabetic (34.9%). Mild imbalance — a baseline classifier that always predicts non-diabetic achieves 65.1% accuracy.

Step 1: Data Quality — The Zero Problem

Several features have biologically impossible zero values. Glucose = 0 would be fatal; BMI = 0 is impossible. These are missing values encoded as zeros.

python

zero_cols = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']
print("Zero counts in biological features:")
for col in zero_cols:
    n_zero = (df[col] == 0).sum()
    print(f"  {col:20s}: {n_zero} zeros ({n_zero/len(df)*100:.1f}%)")

Zero counts in biological features:
  Glucose             :  5 zeros (0.7%)
  BloodPressure       : 35 zeros (4.6%)
  SkinThickness       :227 zeros (29.6%)
  Insulin             :374 zeros (48.7%)
  BMI                 : 11 zeros (1.4%)

Insulin has 374 zeros (49% of rows) — nearly half the dataset has missing insulin measurements. Replace zeros with NaN and impute with the column median (robust to outliers that remain).

python

df[zero_cols] = df[zero_cols].replace(0, np.nan)

for col in zero_cols:
    df[col].fillna(df[col].median(), inplace=True)

print(f"After imputation: {df.isnull().sum().sum()} nulls")

After imputation: 0 nulls

Step 2: EDA — Feature Correlation with Outcome

python

corr = df.corr()['Outcome'].drop('Outcome').sort_values(ascending=False)
print(corr.round(3))

Glucose            0.466
BMI                0.293
Age                0.238
Pregnancies        0.222
DiabetesPedigree   0.173
Insulin            0.130
SkinThickness      0.074
BloodPressure      0.065

Glucose has the strongest correlation (0.47) — expect it to be the root split. BMI and Age follow. BloodPressure has almost no linear relationship with diabetes outcome despite being a standard clinical risk factor.

Step 3: Train/Test Split

python

X = df.drop('Outcome', axis=1)
y = df['Outcome']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"Train: {X_train.shape}, Test: {X_test.shape}")
print(f"Train diabetic rate: {y_train.mean():.3f}, Test: {y_test.mean():.3f}")

Train: (614, 8), Test: (154, 8)
Train diabetic rate: 0.349, Test: 0.350

stratify=y preserves the 35% diabetic rate in both splits.

Step 4: Baseline and Overfitting Gap

python

# Baseline: always predict majority class
dummy = DummyClassifier(strategy='most_frequent')
dummy.fit(X_train, y_train)
print(f"Baseline accuracy: {dummy.score(X_test, y_test):.4f}")

# Unpruned tree
dt_full = DecisionTreeClassifier(random_state=42)
dt_full.fit(X_train, y_train)
print(f"Unpruned — train: {dt_full.score(X_train, y_train):.4f}, test: {dt_full.score(X_test, y_test):.4f}")
print(f"Unpruned — depth: {dt_full.get_depth()}, leaves: {dt_full.get_n_leaves()}")

Baseline accuracy: 0.6494
Unpruned — train: 1.0000, test: 0.7208
Unpruned — depth: 16, leaves: 173

The unpruned tree memorizes all 614 training samples (173 leaves for 614 samples). Test accuracy (72.1%) barely beats the baseline (64.9%) — the tree has overfit badly.

Step 5: Hyperparameter Tuning

python

param_grid = {
    'max_depth': [3, 4, 5, 6, None],
    'min_samples_split': [2, 10, 20],
    'min_samples_leaf': [1, 5, 10],
    'criterion': ['gini', 'entropy'],
}
gs = GridSearchCV(
    DecisionTreeClassifier(random_state=42),
    param_grid, cv=10, scoring='roc_auc', n_jobs=-1
)
gs.fit(X_train, y_train)
print(f"Best params: {gs.best_params_}")
print(f"Best CV AUC: {gs.best_score_:.4f}")

Best params: {'criterion': 'gini', 'max_depth': 4, 'min_samples_leaf': 5, 'min_samples_split': 10}
Best CV AUC: 0.8310

Step 6: Final Evaluation

python

best_dt = gs.best_estimator_
y_pred = best_dt.predict(X_test)
y_prob = best_dt.predict_proba(X_test)[:, 1]

print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print()
print(classification_report(y_test, y_pred, target_names=['Non-Diabetic', 'Diabetic']))
print(f"AUC-ROC: {roc_auc_score(y_test, y_prob):.4f}")

Confusion Matrix:
[[88 12]
 [22 32]]

              precision    recall  f1-score   support
Non-Diabetic       0.80      0.88      0.84       100
    Diabetic       0.73      0.59      0.65        54
    accuracy                           0.78       154

AUC-ROC: 0.8309

Reading the confusion matrix:

TP = 32: diabetics correctly identified
TN = 88: non-diabetics correctly identified
FP = 12: healthy patients flagged as diabetic (unnecessary follow-up)
FN = 22: diabetics predicted as healthy — the clinical danger

FN = 22 means 22 diabetic patients receive a false all-clear. In a clinical screening tool, this is unacceptable. The default threshold of 0.5 optimizes accuracy, not recall.

Step 7: Feature Importance

python

importances = pd.Series(best_dt.feature_importances_, index=X.columns)
importances_sorted = importances.sort_values(ascending=False)
print(importances_sorted.round(4))

Glucose            0.3241
BMI                0.1892
Age                0.1543
DiabetesPedigree   0.1012
Pregnancies        0.0891
BloodPressure      0.0623
Insulin            0.0512
SkinThickness      0.0286

<text x="115" y="34" text-anchor="end" font-size="9" fill="#334155">Glucose</text>
<rect x="120" y="22" width="259" height="18" fill="#3b82f6" rx="2"/>
<text x="385" y="35" font-size="9" fill="#334155">0.324</text>

<text x="115" y="60" text-anchor="end" font-size="9" fill="#334155">BMI</text>
<rect x="120" y="48" width="151" height="18" fill="#3b82f6" rx="2" opacity="0.8"/>
<text x="276" y="61" font-size="9" fill="#334155">0.189</text>

<text x="115" y="86" text-anchor="end" font-size="9" fill="#334155">Age</text>
<rect x="120" y="74" width="123" height="18" fill="#3b82f6" rx="2" opacity="0.7"/>
<text x="248" y="87" font-size="9" fill="#334155">0.154</text>

<text x="115" y="112" text-anchor="end" font-size="9" fill="#334155">DiabetesPedigree</text>
<rect x="120" y="100" width="81" height="18" fill="#3b82f6" rx="2" opacity="0.6"/>
<text x="206" y="113" font-size="9" fill="#334155">0.101</text>

<text x="115" y="138" text-anchor="end" font-size="9" fill="#334155">Pregnancies</text>
<rect x="120" y="126" width="71" height="18" fill="#3b82f6" rx="2" opacity="0.5"/>
<text x="196" y="139" font-size="9" fill="#334155">0.089</text>

<text x="115" y="164" text-anchor="end" font-size="9" fill="#334155">BloodPressure</text>
<rect x="120" y="152" width="50" height="18" fill="#94a3b8" rx="2" opacity="0.5"/>
<text x="175" y="165" font-size="9" fill="#334155">0.062</text>

<text x="115" y="190" text-anchor="end" font-size="9" fill="#334155">Insulin</text>
<rect x="120" y="178" width="41" height="18" fill="#94a3b8" rx="2" opacity="0.5"/>
<text x="166" y="191" font-size="9" fill="#334155">0.051</text>

<text x="115" y="216" text-anchor="end" font-size="9" fill="#334155">SkinThickness</text>
<rect x="120" y="204" width="23" height="18" fill="#94a3b8" rx="2" opacity="0.4"/>
<text x="148" y="217" font-size="9" fill="#334155">0.029</text>

Glucose (32%) and BMI (19%) together account for 51% of the model's decision-making. This matches clinical knowledge: high blood glucose is the primary diabetes indicator, and BMI is a key risk factor. BloodPressure, despite its correlation (0.065), contributes only 6% — the tree found better splits elsewhere.

Step 8: Tree Visualization

python

from sklearn.tree import export_text

print(export_text(best_dt, feature_names=list(X.columns), max_depth=3))

|--- Glucose <= 127.50
|   |--- BMI <= 29.95
|   |   |--- Age <= 28.50
|   |   |   |--- class: 0
|   |   |--- Age > 28.50
|   |   |   |--- class: 0
|   |--- BMI > 29.95
|   |   |--- DiabetesPedigree <= 0.53
|   |   |   |--- class: 0
|   |   |--- DiabetesPedigree > 0.53
|   |   |   |--- class: 1
|--- Glucose > 127.50
|   |--- BMI <= 29.95
|   |   |--- Age <= 28.50
|   |   |   |--- class: 0
|   |   |--- Age > 28.50
|   |   |   |--- class: 1
|   |--- BMI > 29.95
|   |   |--- ...

Root split: Glucose ≤ 127.5 — low glucose → mostly non-diabetic (left). High glucose → deeper splits by BMI, then Age and DiabetesPedigree. The first 3 levels tell a clinically coherent story.

Step 9: Threshold Tuning for Clinical Use

At threshold 0.5: diabetic recall = 59%, FN = 22. For a screening tool, missing 22 diabetics is unacceptable. Lower the threshold to increase recall at the cost of precision:

python

from sklearn.metrics import precision_recall_curve

precisions, recalls, thresholds = precision_recall_curve(y_test, y_prob)

# Find threshold where diabetic recall >= 0.75
target_recall = 0.75
idx = np.argmax(recalls[:-1] >= target_recall)
best_t = thresholds[idx]
print(f"Threshold for recall ≥ 0.75: {best_t:.4f}")

y_pred_adj = (y_prob >= best_t).astype(int)
print(classification_report(y_test, y_pred_adj, target_names=['Non-Diabetic', 'Diabetic']))

Threshold for recall ≥ 0.75: 0.3500

              precision    recall  f1-score   support
Non-Diabetic       0.88      0.79      0.83       100
    Diabetic       0.65      0.78      0.71        54
    accuracy                           0.78       154

At threshold 0.35: diabetic recall improves from 59% to 78%, FN drops from 22 to 12. FP increases from 12 to 21. Precision decreases from 73% to 65%. The overall accuracy stays at 78% — the threshold shifts which errors we make, not the total count.

Project Summary

Step	Action	Key Finding
EDA	Check zero values	374 Insulin zeros (48.7%) → imputed with median
EDA	Correlation	Glucose (0.47) strongest predictor
Baseline	DummyClassifier	64.9% — minimum acceptable bar
Unpruned	DecisionTreeClassifier	Train=1.0, Test=0.72 — severe overfit
Tuning	GridSearchCV (AUC)	Best: Gini, depth=4, min_leaf=5, min_split=10
Evaluation	Confusion matrix	FN=22 — clinical danger, recall=59%
Features	Gini importance	Glucose (32%), BMI (19%) — top 2
Threshold	PR curve at recall≥0.75	t=0.35, FN drops to 12, recall=78%

Test Your Understanding

Insulin has 374 zeros (49% imputed with median). After imputation, all insulin values in the top half of the dataset equal the median. How does this affect the Gini importance of Insulin? Does imputing with median increase or decrease its apparent predictive value?
The root splits at Glucose ≤ 127.5. This is a threshold found during training by maximizing IG. If you re-ran the model with a different random_state in train_test_split, would the root threshold change? What factors determine the root threshold stability?
At threshold 0.5: Precision=0.73, Recall=0.59 for diabetics. At threshold 0.35: Precision=0.65, Recall=0.78. Compute the F₂ score (which weights recall 4× as much as precision) for both thresholds: $F_{2} = 5 \times \frac{P \times R}{4 P + R}$ . Which threshold is better by F₂?
Feature importance is computed as the total Gini reduction attributable to each feature, normalized to sum to 1. A feature that appears in multiple splits (at different depths) accumulates importance. If Glucose appears only at the root but BMI appears at 5 nodes deeper in the tree, can BMI have higher importance than Glucose? Under what conditions?
The tuned tree achieves AUC=0.83. A logistic regression on the same data typically achieves AUC≈0.84. Decision trees are often weaker than logistic regression on tabular data. Why might you still choose the decision tree for a clinical application despite lower AUC?

Decision Tree: Diabetes Prediction Project

Step 1: Data Quality — The Zero Problem

Step 2: EDA — Feature Correlation with Outcome

Step 3: Train/Test Split

Step 4: Baseline and Overfitting Gap

Step 5: Hyperparameter Tuning

Step 6: Final Evaluation

Step 7: Feature Importance

Step 8: Tree Visualization

Step 9: Threshold Tuning for Clinical Use

Project Summary

Test Your Understanding

Comments (0)

Leave a comment