What is Statistics Types of Statistics Population and Sample Central Tendency Measure of Dispersion Bessel's Correction Standard Deviation Variables Random Variables Percentiles and Quartiles Covariance and Correlation Percentiles and Quartiles Histograms Skewness and Kurtosis Exploratory Data Analysis

~/blog

Exploratory Data Analysis

Jun 21, 2026•11 min read•By Mohammed Vasim

StatisticsMathData Science

Statistical tests answer specific questions. EDA finds the questions worth asking.

John Tukey put it plainly: "Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question." EDA is the process of forming hypotheses before you test them. Skip it and you risk fitting a model on data with hidden class imbalance, undetected outliers, near-constant features, or extreme scale mismatches — the model will train, but the answers will be wrong.

EDA has three phases:

Understand the shape of the data — structure, types, missingness
Summarize each variable — univariate distributions
Investigate relationships — bivariate and multivariate structure

Anchors used throughout:

python

import pandas as pd

data = pd.DataFrame({
    'fold':       [1, 2, 3, 4, 5, 6],
    'accuracy':   [0.82, 0.79, 0.91, 0.85, 0.78, 0.88],
    'precision':  [0.80, 0.76, 0.89, 0.83, 0.75, 0.86],
    'train_size': [4200, 4200, 4200, 4200, 4200, 4200]
})

latency_ms = [120, 135, 142, 148, 155, 189, 312, 890]

The tabular data DataFrame represents six cross-validation evaluation runs. latency_ms shows a right-skewed continuous variable for distribution examples.

Phase 1 — Understand the Shape of the Data

These five operations are the first thing to run on any new dataset — in this order:

python

print(data.shape)
print(data.dtypes)
print(data.isnull().sum())
print(data.duplicated().sum())
print(data.describe())

(6, 4)

fold          int64
accuracy    float64
precision   float64
train_size    int64
dtype: object

fold          0
accuracy      0
precision     0
train_size    0
dtype: int64

0

           fold  accuracy  precision   train_size
count  6.000000  6.000000   6.000000          6.0
mean   3.500000  0.838333   0.815000       4200.0
std    1.870829  0.051299   0.054122          0.0
min    1.000000  0.780000   0.750000       4200.0
25%    2.250000  0.797500   0.770000       4200.0
50%    3.500000  0.835000   0.815000       4200.0
75%    4.750000  0.872500   0.852500       4200.0
max    6.000000  0.910000   0.890000       4200.0

What to look for in df.describe():

Signal	What It Tells You
`count` < n_rows	Missing values — cross-check with `isnull().sum()`
`min` or `max` far from mean	Potential outliers
`std` ≈ 0	Near-constant column (low information, can cause numerical issues in linear models)
`mean` far from `50%`	Skewed distribution — use IQR instead of SD for summaries

For the anchor: train_size has std=0 — all folds used the same dataset size. This column adds no information and would cause a zero-variance error in many ML preprocessing steps. Flag it for removal.

Missing Value Patterns

Not all missingness is the same. The type determines the correct response:

Type	Description	Detection	Response
MCAR (Missing Completely At Random)	No pattern — missing randomly	`df.isnull().corr()` shows no correlations	Drop rows or impute mean/median
MAR (Missing At Random)	Missingness depends on another observed variable	Missingness in col A correlates with col B	Conditional imputation (e.g., KNN or model-based)
MNAR (Missing Not At Random)	Missingness depends on the missing value itself	Cannot detect from data alone	Domain knowledge required — dropping rows creates systematic bias

Example of MNAR: users with very low income are less likely to report income. Dropping those rows underestimates poverty rates.

python

# Detect which columns have missingness correlated with other columns
# (MCAR vs MAR diagnostic)
import numpy as np
missing_indicators = data.isnull().astype(int)
if missing_indicators.sum().sum() > 0:
    print(missing_indicators.corr())
else:
    print("No missing values in dataset")

No missing values in dataset

Phase 2 — Univariate Distributions

Inspect each variable individually before looking at relationships. Shape, skewness, and outliers are univariate properties.

Continuous Variables

python

import numpy as np
from scipy import stats

for col in data.select_dtypes(include='number').columns:
    vals = data[col].dropna()
    skew = stats.skew(vals)
    kurt = stats.kurtosis(vals)
    print(f"{col:>12}: mean={vals.mean():.3f}, median={vals.median():.3f}, "
          f"std={vals.std():.3f}, skew={skew:.3f}, kurt={kurt:.3f}")

        fold: mean=3.500, median=3.500, std=1.871, skew=0.000, kurt=-1.200
    accuracy: mean=0.838, median=0.835, std=0.051, skew=0.280, kurt=-1.483
   precision: mean=0.815, median=0.815, std=0.054, skew=0.152, kurt=-1.482
  train_size: mean=4200.000, median=4200.000, std=0.000, skew=0.000, kurt=0.000

accuracy skewness = 0.28 — approximately symmetric (|γ₁| < 0.5). train_size std = 0 — constant, uninformative.

For comparison, latency_ms has strong right skew:

python

latency_ms = [120, 135, 142, 148, 155, 189, 312, 890]
print(f"latency_ms: mean={np.mean(latency_ms):.1f}, median={np.median(latency_ms):.1f}, "
      f"skew={stats.skew(latency_ms):.3f}")

latency_ms: mean=261.4, median=151.5, skew=2.560

latency_ms skewness = 2.56 — highly right-skewed. Mean (261ms) is pulled far right of median (151ms) by the 890ms spike. Consider log transform before modeling.

The SVG below shows histogram (shape) and boxplot (outlier detection) side-by-side for both distributions:

accuracy (symmetric)

2 1 2 1 0.78 0.85 0.91 No outliers (within Tukey fences)

latency_ms (right-skewed)

6 1 1 120–200 300–400 890ms 890 Outlier: 890ms spike

Categorical Variables

Use bar charts, not pie charts — proportions are harder to compare in a pie. For the anchor, fold is ordered discrete (not truly categorical here), but the pattern applies:

python

# For a genuinely categorical column:
# print(df['category_col'].value_counts())
# Flag if any category has < 5% of total
n = len(data)
for col in data.select_dtypes(include='object').columns:
    counts = data[col].value_counts()
    low_freq = counts[counts / n < 0.05]
    if len(low_freq) > 0:
        print(f"Low-frequency categories in '{col}': {low_freq.index.tolist()}")

Phase 3 — Bivariate and Multivariate Relationships

Correlation Matrix

python

# Compute correlation matrix (Pearson r by default)
corr = data[['accuracy', 'precision', 'fold']].corr()
print(corr.round(3))

           accuracy  precision   fold
accuracy      1.000      0.998  0.219
precision     0.998      1.000  0.169
fold          0.219      0.169  1.000

What to flag:

|r| > 0.9 → multicollinearity concern for linear models: accuracy and precision have r = 0.998. Including both as features in a linear regression introduces near-singular behavior — one should be dropped or PCA applied.
|r| < 0.05 → potentially uninformative feature for a linear model (though nonlinear models may still find it useful).

SVG below shows the 3×3 correlation heatmap:

accuracy precision fold

1.000 0.998 0.219 0.998 1.000 0.169 0.219 0.169 1.000

accuracy vs precision: r=0.998 → multicollinearity risk

Scatter Plot

python

# Scatter: accuracy vs precision
import matplotlib.pyplot as plt
plt.scatter(data['accuracy'], data['precision'])
plt.xlabel('accuracy'); plt.ylabel('precision')
plt.title('accuracy vs precision (r=0.998)')

The scatter would show near-perfect alignment — every point lies almost exactly on a line. In a feature matrix, this degree of collinearity means one of these columns adds almost no additional information to a linear model.

Pair Plot

python

import seaborn as sns
g = sns.pairplot(data[['accuracy', 'precision', 'fold']], diag_kind='kde')

The pair plot diagonal shows the marginal distribution of each variable (KDE curve). The off-diagonal cells show all pairwise scatter plots. If you color by a categorical target (hue='target'), class separation becomes visible without running a single model — useful for spotting whether the classes are linearly separable.

Group-wise Comparisons

When a grouping variable exists, groupby shows whether group means differ — setting up a future t-test or ANOVA:

python

# Hypothetical: did early folds (1-3) vs late folds (4-6) differ in accuracy?
data['phase'] = data['fold'].apply(lambda x: 'early' if x <= 3 else 'late')
print(data.groupby('phase')[['accuracy', 'precision']].agg(['mean', 'std', 'count']))

       accuracy              precision            
           mean       std count      mean       std count
phase                                                    
early  0.840000  0.062450     3  0.816667  0.065064     3
late   0.836667  0.045092     3  0.813333  0.045735     3

The means are nearly identical — no obvious phase effect. This tells you the hypothesis "later folds have higher accuracy due to better data ordering" is not supported here. You would still run a t-test to confirm, but EDA just saved you from chasing a non-existent effect.

Outlier Investigation

EDA detects outliers. It does not automatically remove them. Three detection strategies:

Strategy 1 — Z-score method:

python

accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88]
mean_acc = sum(accuracy) / len(accuracy)
std_acc  = (sum((x - mean_acc)**2 for x in accuracy) / (len(accuracy)-1)) ** 0.5
z_scores = [(x - mean_acc) / std_acc for x in accuracy]

for fold, (acc, z) in enumerate(zip(accuracy, z_scores), 1):
    flag = " ← outlier" if abs(z) > 3 else ""
    print(f"Fold {fold}: acc={acc:.2f}, z={z:+.3f}{flag}")

Fold 1: acc=0.82, z=-0.358
Fold 2: acc=0.79, z=-0.944
Fold 3: acc=0.91, z=+1.401
Fold 4: acc=0.85, z=+0.228
Fold 5: acc=0.78, z=-1.140
Fold 6: acc=0.88, z=+0.815

No fold exceeds |z| = 3. All values are within 1.5 SDs of the mean.

Strategy 2 — IQR method (robust):

python

import numpy as np

latency_ms = [120, 135, 142, 148, 155, 189, 312, 890]
q1 = np.percentile(latency_ms, 25)
q3 = np.percentile(latency_ms, 75)
iqr = q3 - q1
lower = q1 - 1.5 * iqr
upper = q3 + 1.5 * iqr

print(f"Q1={q1:.1f}, Q3={q3:.1f}, IQR={iqr:.1f}")
print(f"Fences: [{lower:.1f}, {upper:.1f}]")
outliers = [x for x in latency_ms if x < lower or x > upper]
print(f"Outliers: {outliers}")

Q1=143.5, Q3=196.5, IQR=53.0
Fences: [64.0, 276.0]
Outliers: [312, 890]

Both 312ms and 890ms are flagged by IQR. The z-score method on the same data would not flag them as strongly because the outliers themselves inflate the standard deviation, shrinking their z-scores.

The investigation rule — never drop without understanding:

Finding	Action
Data error (mislabeled sample, sensor malfunction, wrong units)	Correct the error or remove the row
Rare genuine event (infrastructure failure, unusual request)	Keep it, document it, consider robust methods
Influential point (changes model significantly when included/excluded)	Analyze with and without; report both results

For latency_ms = 890: Is this a real request that took 890ms (perhaps a cold-start or cache miss) or a logging error? Without that answer, removing it is unjustified.

EDA Checklist

Before modeling or hypothesis testing, confirm every item below:

Shape: n_rows, n_cols, dtypes verified — constant columns (std=0) identified
Missing values: counts checked, pattern type assessed (MCAR / MAR / MNAR)
Duplicates: checked and resolved
Univariate — numeric: histogram and boxplot inspected, skewness computed, log transform flagged if |γ₁| > 1
Univariate — categorical: value counts checked, low-frequency categories (< 5%) flagged
Bivariate: correlation matrix computed, |r| > 0.9 (collinearity) and |r| < 0.05 (uninformative) both flagged
Outliers: detected via both IQR and z-score, each outlier investigated before any decision to remove
Scale: whether features need standardization before modeling noted (especially for distance-based algorithms)

EDA draws on every descriptive statistics technique covered so far: mean and SD (Phase 1 summaries), skewness and kurtosis (Phase 2 shape checks), percentiles and IQR (boxplots and outlier fences), histograms (univariate distributions), correlation (Phase 3 relationships). Once EDA confirms the data is clean and you understand its structure, the next step is inferential statistics — forming specific hypotheses and testing them with t-tests, chi-square tests, ANOVA, or regression.

When This Breaks Down

EDA is not a substitute for domain knowledge. df.describe() can tell you the mean is far from the median but not whether that is a genuine data property or a labeling error. Correlation matrices find linear associations but miss everything else — always plot the scatter. Pair plots scale poorly beyond 10 variables; for wide datasets, start with a variance threshold to drop near-constant columns, then use hierarchical clustering of the correlation matrix to find groups of correlated features. Finally: EDA is inherently subjective. Two analysts may draw different hypotheses from the same plots. Document what you found, what you assumed, and what you left for the statistical tests to decide.

Test Your Understanding

You load a dataset and find that df.describe() shows mean=35.0 and 50%=18.0 for a numeric column. What does this suggest about the distribution's shape? What would you plot first to confirm, and what transform would you apply before modeling?
You detect 3 rows with accuracy = 0.00 in a model evaluation dataset. List the three possible explanations (data error, genuine result, influential point) and describe how you would determine which one applies.
A correlation matrix shows r = 0.97 between feature_A and feature_B. You plan to train a logistic regression. What specific problem does this create, and what two remedies would you consider?
You compute the IQR-based outlier fences for inference latency and flag 8 out of 200 requests. Your colleague argues all 8 should be removed because they "contaminate the distribution." What would you ask about each flagged request before deciding?
Why does the z-score method underperform compared to IQR when detecting outliers in a heavily right-skewed distribution? Give a concrete example where z-score misses an outlier that IQR catches.

Exploratory Data Analysis

Phase 1 — Understand the Shape of the Data

Missing Value Patterns

Phase 2 — Univariate Distributions

Continuous Variables

Categorical Variables

Phase 3 — Bivariate and Multivariate Relationships

Correlation Matrix

Scatter Plot

Pair Plot

Group-wise Comparisons

Outlier Investigation

EDA Checklist

When This Breaks Down

Test Your Understanding

Comments (0)

Leave a comment

Exploratory Data Analysis

Phase 1 — Understand the Shape of the Data

Missing Value Patterns

Phase 2 — Univariate Distributions

Continuous Variables

Categorical Variables

Phase 3 — Bivariate and Multivariate Relationships

Correlation Matrix

Scatter Plot

Pair Plot

Group-wise Comparisons

Outlier Investigation

EDA Checklist

Related Concepts

When This Breaks Down

Test Your Understanding

Comments (0)

Leave a comment