Back to blog
← View series: machine learning

Handling Missing Values Handling Imbalanced Datasets SMOTE Handling Outliers Categorical Encoding Target Guided Ordinal Encoding Exploratory Data Analysis of Red Wine Quality Feature Engineering for Flight Price Prediction Cleaning and Transforming Google Play Store Data for Analysis

~/blog

Cleaning and Transforming Google Play Store Data for Analysis

Jun 1, 2026•9 min read•By Mohammed Vasim

Machine LearningAIData Science

The Google Play Store hosts over 2.5 million apps. That volume makes it a rich dataset for understanding what drives app popularity, but the raw CSV is far from analysis-ready. Numbers arrive as strings, sizes mix "M" and "k" suffixes, install counts include commas and plus signs, and one row has its columns shifted entirely out of alignment.

This walkthrough covers the full cleaning pipeline — parsing, converting, and validating each column — so the dataset is ready for exploratory analysis and modelling.

Data Overview

The dataset contains 10,841 rows and 13 columns covering app name, category, rating, number of reviews, size, install count, type (free or paid), price, content rating, genres, last-updated date, current version, and minimum Android version.

We need the usual data-wrangling tools plus visualisation libraries for the exploratory phase.

python

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

warnings.filterwarnings("ignore")
%matplotlib inline

Load the dataset directly from the public URL and take a first look at the structure.

python

df = pd.read_csv('https://raw.githubusercontent.com/krishnaik06/playstore-Dataset/main/googleplaystore.csv')
df.head()

	App	Category	Rating	Reviews	Size	Installs	Type	Content Rating	Genres	Last Updated	Current Ver	Android Ver
0	Photo Editor & Candy Camera & Grid & ScrapBook	ART_AND_DESIGN	4.1	159	19M	10,000+	Free	Everyone	Art & Design	January 7, 2018	1.0.0	4.0.3 and up
1	Coloring book moana	ART_AND_DESIGN	3.9	967	14M	500,000+	Free	Everyone	Art & Design;Pretend Play	January 15, 2018	2.0.0	4.0.3 and up
2	U Launcher Lite – FREE Live Cool Themes, Hide ...	ART_AND_DESIGN	4.7	87510	8.7M	5,000,000+	Free	Everyone	Art & Design	August 1, 2018	1.2.4	4.0.3 and up
3	Sketch - Draw & Paint	ART_AND_DESIGN	4.5	215644	25M	50,000,000+	Free	Teen	Art & Design	June 8, 2018	Varies with device	4.2 and up
4	Pixel Draw - Number Art Coloring Book	ART_AND_DESIGN	4.3	967	2.8M	100,000+	Free	Everyone	Art & Design;Creativity	June 20, 2018	1.1	4.4 and up

Before any cleaning, check the dataset dimensions, column types, summary statistics, and missing-value counts. This gives a baseline for what needs fixing.

python

print(df.shape)
print()
df.info()
print()
print(df.describe())
print()
print(df.isnull().sum())

A Misaligned Row

Real-world data frequently arrives with structural defects. Row 10,472 has values shifted one column to the left: "1.9" appears under the Category column, "19.0" under Rating, "3.0M" under Reviews, and so on. Every value is one position off. This row cannot be salvaged by parsing — it has to be removed.

python

df[~df['Reviews'].str.isnumeric()]

	App	Category	Rating	Reviews	Size	Installs	Type	Price	Content Rating	Genres	Last Updated	Current Ver	Android Ver
10472	Life Made WI-Fi Touchscreen Photo Frame	1.9	19.0	3.0M	1,000+	Free	0	Everyone	NaN	February 11, 2018	1.0.19	4.0 and up	NaN

Drop the bad row on a copy of the data, verify no non-numeric review strings remain, then cast the column to int.

python

df_copy = df.copy()
df_copy = df_copy.drop(df_copy.index[10472])
# Verify the bad row is gone
assert len(df_copy[~df_copy['Reviews'].str.isnumeric()]) == 0
df_copy['Reviews'] = df_copy['Reviews'].astype(int)
df_copy.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10840 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10840 non-null  object 
 1   Category        10840 non-null  object 
 2   Rating          9366 non-null   float64
 3   Reviews         10840 non-null  int64  
 4   Size            10840 non-null  object 
 5   Installs        10840 non-null  object 
 6   Type            10839 non-null  object 
 7   Price           10840 non-null  object 
 8   Content Rating  10840 non-null  object 
 9   Genres          10840 non-null  object 
 10  Last Updated    10840 non-null  object 
 11  Current Ver     10832 non-null  object 
 12  Android Ver     10838 non-null  object 
dtypes: float64(1), int64(1), object(11)
memory usage: 1.2+ MB

Size: From Strings to Numbers

The Size column uses suffixes "M" (megabytes), "k" (kilobytes), and the literal string "Varies with device". To make it numerically usable, replace "M" with "000" (converting MB to KB), strip "k", mark "Varies with device" as NaN, and cast to float. Note that values originally ending in "M" and a fraction — "8.7M" — become 8.7 after the replacement, indicating that the raw data stores some sizes in MB and others in KB inconsistently.

python

df_copy['Size'] = df_copy['Size'].str.replace('M', '000')
df_copy['Size'] = df_copy['Size'].str.replace('k', '')
df_copy['Size'] = df_copy['Size'].replace('Varies with device', np.nan)
df_copy['Size'] = df_copy['Size'].astype(float)

Installs and Price

Both columns carry non-numeric characters: "+", ",", and "$". Strip them all in one pass, then cast to the appropriate numeric types.

python

chars_to_remove = ['+', ',', '$']
cols_to_clean = ['Installs', 'Price']
for char in chars_to_remove:
    for col in cols_to_clean:
        df_copy[col] = df_copy[col].str.replace(char, '')

df_copy['Installs'] = df_copy['Installs'].astype('int')
df_copy['Price'] = df_copy['Price'].astype('float')
df_copy.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10840 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10840 non-null  object 
 1   Category        10840 non-null  object 
 2   Rating          9366 non-null   float64
 3   Reviews         10840 non-null  int64  
 4   Size            9145 non-null   float64
 5   Installs        10840 non-null  int64  
 6   Type            10839 non-null  object 
 7   Price           10840 non-null  float64
 8   Content Rating  10840 non-null  object 
 9   Genres          10840 non-null  object 
 10  Last Updated    10840 non-null  object 
 11  Current Ver     10832 non-null  object 
 12  Android Ver     10838 non-null  object 
dtypes: float64(3), int64(2), object(8)
memory usage: 1.2+ MB

Extracting Date Features

The "Last Updated" column stores dates as human-readable strings ("January 7, 2018"). Parse them into proper datetime objects with pd.to_datetime, then extract the day, month, and year components as separate integer columns.

python

df_copy['Last Updated'] = pd.to_datetime(df_copy['Last Updated'])
df_copy['Day'] = df_copy['Last Updated'].dt.day
df_copy['Month'] = df_copy['Last Updated'].dt.month
df_copy['Year'] = df_copy['Last Updated'].dt.year
df_copy.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10840 entries, 0 to 10840
Data columns (total 16 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   App             10840 non-null  object        
 1   Category        10840 non-null  object        
 2   Rating          9366 non-null   float64       
 3   Reviews         10840 non-null  int64         
 4   Size            9145 non-null   float64       
 5   Installs        10840 non-null  int64         
 6   Type            10839 non-null  object        
 7   Price           10840 non-null  float64       
 8   Content Rating  10840 non-null  object        
 9   Genres          10840 non-null  object        
 10  Last Updated    10840 non-null  datetime64[ns]
 11  Current Ver     10832 non-null  object        
 12  Android Ver     10838 non-null  object        
 13  Day             10840 non-null  int64         
 14  Month           10840 non-null  int64         
 15  Year            10840 non-null  int64         
dtypes: datetime64[ns](1), float64(3), int64(5), object(7)
memory usage: 1.4+ MB

Preview the fully cleaned dataset before moving to exploration.

python

df_copy.head()

	App	Category	Rating	Reviews	Size	Installs	Type	Content Rating	Genres	Last Updated	Current Ver	Android Ver	Day	Month	Year
0	Photo Editor & Candy Camera & Grid & ScrapBook	ART_AND_DESIGN	4.1	159	19000.0	10000	Free	Everyone	Art & Design	2018-01-07	1.0.0	4.0.3 and up	7	1	2018
1	Coloring book moana	ART_AND_DESIGN	3.9	967	14000.0	500000	Free	Everyone	Art & Design;Pretend Play	2018-01-15	2.0.0	4.0.3 and up	15	1	2018
2	U Launcher Lite – FREE Live Cool Themes, Hide ...	ART_AND_DESIGN	4.7	87510	8.7	5000000	Free	Everyone	Art & Design	2018-08-01	1.2.4	4.0.3 and up	1	8	2018
3	Sketch - Draw & Paint	ART_AND_DESIGN	4.5	215644	25000.0	50000000	Free	Teen	Art & Design	2018-06-08	Varies with device	4.2 and up	8	6	2018
4	Pixel Draw - Number Art Coloring Book	ART_AND_DESIGN	4.3	967	2.8	100000	Free	Everyone	Art & Design;Creativity	2018-06-20	1.1	4.4 and up	20	6	2018

Duplicate Rows

The dataset contains 1,181 duplicate app entries. Keeping only the first occurrence per app name reduces the row count.

python

print('Duplicate apps:', df_copy[df_copy.duplicated('App')].shape[0])
df_copy = df_copy.drop_duplicates(subset=['App'], keep='first')
print('Rows after dedup:', df_copy.shape[0])

Numeric and Categorical Columns

Classifying columns by their data type clarifies which transformations remain and which visualisation approaches apply.

python

numeric_features = [feat for feat in df_copy.columns if df_copy[feat].dtype != 'O']
categorical_features = [feat for feat in df_copy.columns if df_copy[feat].dtype == 'O']

print(f'We have {len(numeric_features)} numerical features : {numeric_features}')
print(f'\nWe have {len(categorical_features)} categorical features : {categorical_features}')

We have 9 numerical features : ['Rating', 'Reviews', 'Size', 'Installs', 'Price', 'Last Updated', 'Day', 'Month', 'Year']

We have 7 categorical features : ['App', 'Category', 'Type', 'Content Rating', 'Genres', 'Current Ver', 'Android Ver']

Categorical Distributions

Checking the proportion of each category reveals imbalances: 92.6% of apps are free and 80% target "Everyone".

python

for col in ['Type', 'Content Rating']:
    print(df_copy[col].value_counts(normalize=True) * 100)
    print('---------------------------')

Free    92.610701
Paid     7.380074
0        0.009225
Name: Type, dtype: float64
---------------------------
Everyone           80.387454
Teen               11.143911
Mature 17+          4.603321
Everyone 10+        3.819188
Adults only 18+     0.027675
Unrated             0.018450
Name: Content Rating, dtype: float64
---------------------------

KDE plots for the numerical columns confirm the skew visible in the summary statistics.

python

plt.figure(figsize=(15, 15))
plt.suptitle('Univariate Analysis of Numerical Features', fontsize=20, fontweight='bold', alpha=0.8, y=1.)

for i in range(len(numeric_features)):
    plt.subplot(5, 3, i+1)
    sns.kdeplot(x=df_copy[numeric_features[i]], shade=True, color='r')
    plt.xlabel(numeric_features[i])
    plt.tight_layout()

Rating and Year are left-skewed, while Reviews, Size, Installs, and Price are heavily right-skewed — typical for marketplace data where a small number of apps capture most of the traffic.

Most Popular App Categories

Family, Game, and Tools are the three largest categories by sheer number of apps.

python

category_counts = pd.DataFrame(df_copy['Category'].value_counts())
category_counts.rename(columns={'Category': 'Count'}, inplace=True)

plt.figure(figsize=(15, 12))
plt.subplot(1, 2, 1)
df_copy['Category'].value_counts().plot.pie(autopct='%1.1f', figsize=(15, 16))

plt.subplot(1, 2, 2)
sns.barplot(x=category_counts.index[:10], y='Count', data=category_counts[:10], palette='hls')
plt.title('Top 10 App Categories')
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()

Category With the Most Installs

By aggregating install counts per category, Game emerges as the clear leader with nearly 35 billion installs.

python

df_cat_installs = df_copy.groupby(['Category'])['Installs'].sum().sort_values(ascending=False).reset_index()
df_cat_installs.Installs = df_cat_installs.Installs / 1_000_000_000
df2 = df_cat_installs.head(10)

plt.figure(figsize=(14, 10))
sns.set_context("talk")
sns.set_style("darkgrid")

ax = sns.barplot(x='Installs', y='Category', data=df2)
ax.set_xlabel('No. of Installations in Billions')
ax.set_ylabel('')
ax.set_title("Most Popular Categories in Play Store", size=20)
plt.show()

Top 5 Most Installed Apps Per Category

Drilling into a few key categories — Game, Communication, Productivity, Social — shows which individual apps lead each segment.

python

dfa = df_copy.groupby(['Category', 'App'])['Installs'].sum().reset_index()
dfa = dfa.sort_values('Installs', ascending=False)
apps = ['GAME', 'COMMUNICATION', 'PRODUCTIVITY', 'SOCIAL']

sns.set_context("poster")
sns.set_style("darkgrid")

plt.figure(figsize=(40, 30))

for i, app in enumerate(apps):
    df2 = dfa[dfa.Category == app]
    df3 = df2.head(5)
    plt.subplot(4, 2, i+1)
    sns.barplot(data=df3, x='Installs', y='App')
    plt.xlabel('Installation in Millions')
    plt.ylabel('')
    plt.title(app, size=20)

plt.tight_layout()
plt.subplots_adjust(hspace=.3)
plt.show()

Five-Star Apps

Only 271 apps in the dataset hold a perfect 5.0 rating. The top-ranked five-star app by installs is "CS & IT Interview Questions" from the Family category.

python

rating = df_copy.groupby(['Category', 'Installs', 'App'])['Rating'].sum().sort_values(ascending=False).reset_index()
toprating_apps = rating[rating.Rating == 5.0]
print("Number of 5 rated apps", toprating_apps.shape[0])
toprating_apps.head(1)

Number of 5 rated apps 271

	Category	Installs	App	Rating
0	FAMILY	1000	CS & IT Interview Questions	5.0

After cleaning, the dataset dropped from 10,841 to 9,659 unique apps with properly typed numeric columns and extracted date features. The remaining missing values (Rating, Size, Type, Current Ver, Android Ver) are candidates for imputation or exclusion depending on the modelling approach. The EDA confirms what you would expect from a marketplace: a long tail of small apps and a small head of dominant players in Games and Communication.

Cleaning and Transforming Google Play Store Data for Analysis

Data Overview

A Misaligned Row

Size: From Strings to Numbers

Installs and Price

Extracting Date Features

Duplicate Rows

Numeric and Categorical Columns

Categorical Distributions

Most Popular App Categories

Category With the Most Installs

Top 5 Most Installed Apps Per Category

Five-Star Apps

Comments (0)

Leave a comment