Back to blog
← View series: machine learning

Handling Missing Values Handling Imbalanced Datasets SMOTE Handling Outliers Categorical Encoding Target Guided Ordinal Encoding Exploratory Data Analysis of Red Wine Quality Feature Engineering for Flight Price Prediction Cleaning and Transforming Google Play Store Data for Analysis

~/blog

Exploratory Data Analysis of Red Wine Quality

Jun 1, 2026•5 min read•By Mohammed Vasim

Machine LearningAIData Science

EDA on the Red Wine Quality Dataset

The Red Wine Quality dataset captures physicochemical measurements of Portuguese Vinho Verde wines — acidity levels, sulfur dioxide, residual sugar, pH, alcohol — alongside sensory quality scores from 3 to 8. Before any modeling, exploratory data analysis reveals the dataset's structure, cleanliness, and feature relationships. This walkthrough covers descriptive statistics, missing value inspection, duplicate detection, and correlation analysis.

python

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv('winequality-red.csv')
df.head()

All features are numeric — 11 physicochemical inputs and one quality score. The info() method confirms the column types and memory footprint, while describe() surfaces central tendency, spread, and range for each feature.

python

df.info()

Descriptive statistics — count, mean, standard deviation, min, quartiles, max — give a first sense of the data's range and distribution. Fixed acidity spans 4.6 to 15.9, alcohol ranges from 8.4% to 14.9%, and quality scores average 5.6.

python

df.describe()

Out[5]:

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
count	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000
mean	8.319637	0.527821	0.270976	2.538806	0.087467	15.874922	46.467792	0.996747	3.311113	0.658149	10.422983	5.636023
std	1.741096	0.179060	0.194801	1.409928	0.047065	10.460157	32.895324	0.001887	0.154386	0.169507	1.065668	0.807569
min	4.600000	0.120000	0.000000	0.900000	0.012000	1.000000	6.000000	0.990070	2.740000	0.330000	8.400000	3.000000
25%	7.100000	0.390000	0.090000	1.900000	0.070000	7.000000	22.000000	0.995600	3.210000	0.550000	9.500000	5.000000
50%	7.900000	0.520000	0.260000	2.200000	0.079000	14.000000	38.000000	0.996750	3.310000	0.620000	10.200000	6.000000
75%	9.200000	0.640000	0.420000	2.600000	0.090000	21.000000	62.000000	0.997835	3.400000	0.730000	11.100000	6.000000
max	15.900000	1.580000	1.000000	15.500000	0.611000	72.000000	289.000000	1.003690	4.010000	2.000000	14.900000	8.000000

The dataset has 1599 rows and 12 columns. Quality scores span 3 through 8 — the extreme ends (3 and 8) will be sparse compared to the middle values.

python

print(df.shape)
print(df.columns.tolist())
print(df['quality'].unique())

A missing value check shows whether columns need imputation or dropping. This dataset is unusually clean — every column has zero nulls across all rows, which is rare in real-world data.

python

df.isnull().sum()

Out[10]:
fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
dtype: int64

Duplicate rows inflate metrics without adding information. The dataset contains 240 exact duplicates — over 15% of the data. Dropping them leaves 1359 unique observations.

python

duplicates = df[df.duplicated()]
print(f"Duplicate rows: {len(duplicates)}")
df.drop_duplicates(inplace=True)
print(f"Shape after dedup: {df.shape}")

The correlation matrix quantifies linear relationships between all feature pairs. Values range from -1 (strong negative) to +1 (strong positive). This is the primary analytical tool for understanding which physicochemical properties move together.

python

df.corr()

Out[15]:

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
fixed acidity	1.000000	-0.255124	0.667437	0.111025	0.085886	-0.140580	-0.103777	0.670195	-0.686685	0.190269	-0.061596	0.119024
volatile acidity	-0.255124	1.000000	-0.551248	-0.002449	0.055154	-0.020945	0.071701	0.023943	0.247111	-0.256948	-0.197812	-0.395214
citric acid	0.667437	-0.551248	1.000000	0.143892	0.210195	-0.048004	0.047358	0.357962	-0.550310	0.326062	0.105108	0.228057
residual sugar	0.111025	-0.002449	0.143892	1.000000	0.026656	0.160527	0.201038	0.324522	-0.083143	-0.011837	0.063281	0.013640
chlorides	0.085886	0.055154	0.210195	0.026656	1.000000	0.000749	0.045773	0.193592	-0.270893	0.394557	-0.223824	-0.130988
free sulfur dioxide	-0.140580	-0.020945	-0.048004	0.160527	0.000749	1.000000	0.667246	-0.018071	0.056631	0.054126	-0.080125	-0.050463
total sulfur dioxide	-0.103777	0.071701	0.047358	0.201038	0.045773	0.667246	1.000000	0.078141	-0.079257	0.035291	-0.217829	-0.177855
density	0.670195	0.023943	0.357962	0.324522	0.193592	-0.018071	0.078141	1.000000	-0.355617	0.146036	-0.504995	-0.184252
pH	-0.686685	0.247111	-0.550310	-0.083143	-0.270893	0.056631	-0.079257	-0.355617	1.000000	-0.214134	0.213418	-0.055245
sulphates	0.190269	-0.256948	0.326062	-0.011837	0.394557	0.054126	0.035291	0.146036	-0.214134	1.000000	0.091621	0.248835
alcohol	-0.061596	-0.197812	0.105108	0.063281	-0.223824	-0.080125	-0.217829	-0.504995	0.213418	0.091621	1.000000	0.480343
quality	0.119024	-0.395214	0.228057	0.013640	-0.130988	-0.050463	-0.177855	-0.184252	-0.055245	0.248835	0.480343	1.000000

A heatmap makes the correlation matrix easier to scan. Darker colors indicate stronger correlations. Notable relationships include the strong negative pair between fixed acidity and pH (-0.69), and the moderate positive correlation between alcohol and quality (0.48) which makes it a promising predictor.

python

plt.figure(figsize=(10, 6))
sns.heatmap(df.corr(), annot=True)
plt.show()

Out[21]:
<AxesSubplot: >

The quality column is the target. Counting how many wines fall into each score reveals the class balance — or more accurately, the imbalance. Most wines cluster around 5 and 6, while extreme scores are rare.

python

df['quality'].value_counts().sort_index().plot(kind='bar')
plt.xlabel("Wine Quality")
plt.ylabel("Count")
plt.show()

Plotting distributions for each feature helps spot skewness, outliers, and modes. A histogram with a KDE overlay shows both the frequency and the estimated density for each column.

python

for column in df.columns:
    sns.histplot(df[column], kde=True)
    plt.show()

Alcohol, as one of the stronger correlates with quality, deserves a closer look. Its distribution peaks around 9.5–10%, with a long tail toward higher percentages that correspond to better-scored wines.

python

sns.histplot(df['alcohol'])
plt.show()

Multivariate plots reveal interactions between features that correlations alone miss. Pairplots show all pairwise relationships in one grid, box plots grouped by quality highlight how alcohol levels shift with score, and scatter plots with color encoding by quality expose patterns across three dimensions.

python

sns.pairplot(df)
plt.show()

Grouping alcohol by quality in a box plot confirms the positive trend: wines with higher quality scores tend to have higher alcohol content, with noticeably less overlap at the extremes.

python

sns.catplot(x='quality', y='alcohol', data=df, kind='box')
plt.show()

A scatter plot of alcohol versus pH, colored by quality, shows how two features interact. Lower pH (more acidic) wines with higher alcohol tend to score better — a pattern that emerges only when looking at variables together rather than in isolation.

python

sns.scatterplot(x='alcohol', y='pH', hue='quality', data=df)
plt.show()

The dataset is clean — no missing values — but the 15% duplicate rate and the imbalanced quality distribution both demand attention before any modeling effort. Alcohol shows the strongest positive correlation with quality (0.48), while volatile acidity has a notable negative relationship (-0.40). These signals guide which features to emphasize in a predictive model.

Exploratory Data Analysis of Red Wine Quality

EDA on the Red Wine Quality Dataset

Comments (0)

Leave a comment