Back to blog
← View series: machine learning

~/blog

Exploratory Data Analysis of Red Wine Quality

Jun 1, 20265 min readBy Mohammed Vasim
Machine LearningAIData Science

EDA on the Red Wine Quality Dataset

The Red Wine Quality dataset captures physicochemical measurements of Portuguese Vinho Verde wines — acidity levels, sulfur dioxide, residual sugar, pH, alcohol — alongside sensory quality scores from 3 to 8. Before any modeling, exploratory data analysis reveals the dataset's structure, cleanliness, and feature relationships. This walkthrough covers descriptive statistics, missing value inspection, duplicate detection, and correlation analysis.

python
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv('winequality-red.csv')
df.head()

All features are numeric — 11 physicochemical inputs and one quality score. The info() method confirms the column types and memory footprint, while describe() surfaces central tendency, spread, and range for each feature.

python
df.info()

Descriptive statistics — count, mean, standard deviation, min, quartiles, max — give a first sense of the data's range and distribution. Fixed acidity spans 4.6 to 15.9, alcohol ranges from 8.4% to 14.9%, and quality scores average 5.6.

python
df.describe()
Out[5]:
fixed acidityvolatile aciditycitric acidresidual sugarchloridesfree sulfur dioxidetotal sulfur dioxidedensitypHsulphatesalcoholquality
count1599.0000001599.0000001599.0000001599.0000001599.0000001599.0000001599.0000001599.0000001599.0000001599.0000001599.0000001599.000000
mean8.3196370.5278210.2709762.5388060.08746715.87492246.4677920.9967473.3111130.65814910.4229835.636023
std1.7410960.1790600.1948011.4099280.04706510.46015732.8953240.0018870.1543860.1695071.0656680.807569
min4.6000000.1200000.0000000.9000000.0120001.0000006.0000000.9900702.7400000.3300008.4000003.000000
25%7.1000000.3900000.0900001.9000000.0700007.00000022.0000000.9956003.2100000.5500009.5000005.000000
50%7.9000000.5200000.2600002.2000000.07900014.00000038.0000000.9967503.3100000.62000010.2000006.000000
75%9.2000000.6400000.4200002.6000000.09000021.00000062.0000000.9978353.4000000.73000011.1000006.000000
max15.9000001.5800001.00000015.5000000.61100072.000000289.0000001.0036904.0100002.00000014.9000008.000000

The dataset has 1599 rows and 12 columns. Quality scores span 3 through 8 — the extreme ends (3 and 8) will be sparse compared to the middle values.

python
print(df.shape)
print(df.columns.tolist())
print(df['quality'].unique())

A missing value check shows whether columns need imputation or dropping. This dataset is unusually clean — every column has zero nulls across all rows, which is rare in real-world data.

python
df.isnull().sum()
Out[10]:
fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
dtype: int64

Duplicate rows inflate metrics without adding information. The dataset contains 240 exact duplicates — over 15% of the data. Dropping them leaves 1359 unique observations.

python
duplicates = df[df.duplicated()]
print(f"Duplicate rows: {len(duplicates)}")
df.drop_duplicates(inplace=True)
print(f"Shape after dedup: {df.shape}")

The correlation matrix quantifies linear relationships between all feature pairs. Values range from -1 (strong negative) to +1 (strong positive). This is the primary analytical tool for understanding which physicochemical properties move together.

python
df.corr()
Out[15]:
fixed acidityvolatile aciditycitric acidresidual sugarchloridesfree sulfur dioxidetotal sulfur dioxidedensitypHsulphatesalcoholquality
fixed acidity1.000000-0.2551240.6674370.1110250.085886-0.140580-0.1037770.670195-0.6866850.190269-0.0615960.119024
volatile acidity-0.2551241.000000-0.551248-0.0024490.055154-0.0209450.0717010.0239430.247111-0.256948-0.197812-0.395214
citric acid0.667437-0.5512481.0000000.1438920.210195-0.0480040.0473580.357962-0.5503100.3260620.1051080.228057
residual sugar0.111025-0.0024490.1438921.0000000.0266560.1605270.2010380.324522-0.083143-0.0118370.0632810.013640
chlorides0.0858860.0551540.2101950.0266561.0000000.0007490.0457730.193592-0.2708930.394557-0.223824-0.130988
free sulfur dioxide-0.140580-0.020945-0.0480040.1605270.0007491.0000000.667246-0.0180710.0566310.054126-0.080125-0.050463
total sulfur dioxide-0.1037770.0717010.0473580.2010380.0457730.6672461.0000000.078141-0.0792570.035291-0.217829-0.177855
density0.6701950.0239430.3579620.3245220.193592-0.0180710.0781411.000000-0.3556170.146036-0.504995-0.184252
pH-0.6866850.247111-0.550310-0.083143-0.2708930.056631-0.079257-0.3556171.000000-0.2141340.213418-0.055245
sulphates0.190269-0.2569480.326062-0.0118370.3945570.0541260.0352910.146036-0.2141341.0000000.0916210.248835
alcohol-0.061596-0.1978120.1051080.063281-0.223824-0.080125-0.217829-0.5049950.2134180.0916211.0000000.480343
quality0.119024-0.3952140.2280570.013640-0.130988-0.050463-0.177855-0.184252-0.0552450.2488350.4803431.000000

A heatmap makes the correlation matrix easier to scan. Darker colors indicate stronger correlations. Notable relationships include the strong negative pair between fixed acidity and pH (-0.69), and the moderate positive correlation between alcohol and quality (0.48) which makes it a promising predictor.

python
plt.figure(figsize=(10, 6))
sns.heatmap(df.corr(), annot=True)
plt.show()
Out[21]:
<AxesSubplot: >
notebook-plot

The quality column is the target. Counting how many wines fall into each score reveals the class balance — or more accurately, the imbalance. Most wines cluster around 5 and 6, while extreme scores are rare.

python
df['quality'].value_counts().sort_index().plot(kind='bar')
plt.xlabel("Wine Quality")
plt.ylabel("Count")
plt.show()
notebook-plot

Plotting distributions for each feature helps spot skewness, outliers, and modes. A histogram with a KDE overlay shows both the frequency and the estimated density for each column.

python
for column in df.columns:
    sns.histplot(df[column], kde=True)
    plt.show()

Alcohol, as one of the stronger correlates with quality, deserves a closer look. Its distribution peaks around 9.5–10%, with a long tail toward higher percentages that correspond to better-scored wines.

python
sns.histplot(df['alcohol'])
plt.show()

Multivariate plots reveal interactions between features that correlations alone miss. Pairplots show all pairwise relationships in one grid, box plots grouped by quality highlight how alcohol levels shift with score, and scatter plots with color encoding by quality expose patterns across three dimensions.

python
sns.pairplot(df)
plt.show()

Grouping alcohol by quality in a box plot confirms the positive trend: wines with higher quality scores tend to have higher alcohol content, with noticeably less overlap at the extremes.

python
sns.catplot(x='quality', y='alcohol', data=df, kind='box')
plt.show()

A scatter plot of alcohol versus pH, colored by quality, shows how two features interact. Lower pH (more acidic) wines with higher alcohol tend to score better — a pattern that emerges only when looking at variables together rather than in isolation.

python
sns.scatterplot(x='alcohol', y='pH', hue='quality', data=df)
plt.show()

The dataset is clean — no missing values — but the 15% duplicate rate and the imbalanced quality distribution both demand attention before any modeling effort. Alcohol shows the strongest positive correlation with quality (0.48), while volatile acidity has a notable negative relationship (-0.40). These signals guide which features to emphasize in a predictive model.

Comments (0)

No comments yet. Be the first to comment!

Leave a comment