← View series: machine learning
~/blog
Cleaning and Transforming Google Play Store Data for Analysis
The Google Play Store hosts over 2.5 million apps. That volume makes it a rich dataset for understanding what drives app popularity, but the raw CSV is far from analysis-ready. Numbers arrive as strings, sizes mix "M" and "k" suffixes, install counts include commas and plus signs, and one row has its columns shifted entirely out of alignment.
This walkthrough covers the full cleaning pipeline — parsing, converting, and validating each column — so the dataset is ready for exploratory analysis and modelling.
Data Overview
The dataset contains 10,841 rows and 13 columns covering app name, category, rating, number of reviews, size, install count, type (free or paid), price, content rating, genres, last-updated date, current version, and minimum Android version.
We need the usual data-wrangling tools plus visualisation libraries for the exploratory phase.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
%matplotlib inlineLoad the dataset directly from the public URL and take a first look at the structure.
df = pd.read_csv('https://raw.githubusercontent.com/krishnaik06/playstore-Dataset/main/googleplaystore.csv')
df.head()| App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Photo Editor & Candy Camera & Grid & ScrapBook | ART_AND_DESIGN | 4.1 | 159 | 19M | 10,000+ | Free | 0 | Everyone | Art & Design | January 7, 2018 | 1.0.0 | 4.0.3 and up |
| 1 | Coloring book moana | ART_AND_DESIGN | 3.9 | 967 | 14M | 500,000+ | Free | 0 | Everyone | Art & Design;Pretend Play | January 15, 2018 | 2.0.0 | 4.0.3 and up |
| 2 | U Launcher Lite – FREE Live Cool Themes, Hide ... | ART_AND_DESIGN | 4.7 | 87510 | 8.7M | 5,000,000+ | Free | 0 | Everyone | Art & Design | August 1, 2018 | 1.2.4 | 4.0.3 and up |
| 3 | Sketch - Draw & Paint | ART_AND_DESIGN | 4.5 | 215644 | 25M | 50,000,000+ | Free | 0 | Teen | Art & Design | June 8, 2018 | Varies with device | 4.2 and up |
| 4 | Pixel Draw - Number Art Coloring Book | ART_AND_DESIGN | 4.3 | 967 | 2.8M | 100,000+ | Free | 0 | Everyone | Art & Design;Creativity | June 20, 2018 | 1.1 | 4.4 and up |
Before any cleaning, check the dataset dimensions, column types, summary statistics, and missing-value counts. This gives a baseline for what needs fixing.
print(df.shape)
print()
df.info()
print()
print(df.describe())
print()
print(df.isnull().sum())A Misaligned Row
Real-world data frequently arrives with structural defects. Row 10,472 has values shifted one column to the left: "1.9" appears under the Category column, "19.0" under Rating, "3.0M" under Reviews, and so on. Every value is one position off. This row cannot be salvaged by parsing — it has to be removed.
df[~df['Reviews'].str.isnumeric()]| App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10472 | Life Made WI-Fi Touchscreen Photo Frame | 1.9 | 19.0 | 3.0M | 1,000+ | Free | 0 | Everyone | NaN | February 11, 2018 | 1.0.19 | 4.0 and up | NaN |
Drop the bad row on a copy of the data, verify no non-numeric review strings remain, then cast the column to int.
df_copy = df.copy()
df_copy = df_copy.drop(df_copy.index[10472])
# Verify the bad row is gone
assert len(df_copy[~df_copy['Reviews'].str.isnumeric()]) == 0
df_copy['Reviews'] = df_copy['Reviews'].astype(int)
df_copy.info()<class 'pandas.core.frame.DataFrame'> Int64Index: 10840 entries, 0 to 10840 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 App 10840 non-null object 1 Category 10840 non-null object 2 Rating 9366 non-null float64 3 Reviews 10840 non-null int64 4 Size 10840 non-null object 5 Installs 10840 non-null object 6 Type 10839 non-null object 7 Price 10840 non-null object 8 Content Rating 10840 non-null object 9 Genres 10840 non-null object 10 Last Updated 10840 non-null object 11 Current Ver 10832 non-null object 12 Android Ver 10838 non-null object dtypes: float64(1), int64(1), object(11) memory usage: 1.2+ MB
Size: From Strings to Numbers
The Size column uses suffixes "M" (megabytes), "k" (kilobytes), and the literal string "Varies with device". To make it numerically usable, replace "M" with "000" (converting MB to KB), strip "k", mark "Varies with device" as NaN, and cast to float. Note that values originally ending in "M" and a fraction — "8.7M" — become 8.7 after the replacement, indicating that the raw data stores some sizes in MB and others in KB inconsistently.
df_copy['Size'] = df_copy['Size'].str.replace('M', '000')
df_copy['Size'] = df_copy['Size'].str.replace('k', '')
df_copy['Size'] = df_copy['Size'].replace('Varies with device', np.nan)
df_copy['Size'] = df_copy['Size'].astype(float)Installs and Price
Both columns carry non-numeric characters: "+", ",", and "$". Strip them all in one pass, then cast to the appropriate numeric types.
chars_to_remove = ['+', ',', '$']
cols_to_clean = ['Installs', 'Price']
for char in chars_to_remove:
for col in cols_to_clean:
df_copy[col] = df_copy[col].str.replace(char, '')
df_copy['Installs'] = df_copy['Installs'].astype('int')
df_copy['Price'] = df_copy['Price'].astype('float')
df_copy.info()<class 'pandas.core.frame.DataFrame'> Int64Index: 10840 entries, 0 to 10840 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 App 10840 non-null object 1 Category 10840 non-null object 2 Rating 9366 non-null float64 3 Reviews 10840 non-null int64 4 Size 9145 non-null float64 5 Installs 10840 non-null int64 6 Type 10839 non-null object 7 Price 10840 non-null float64 8 Content Rating 10840 non-null object 9 Genres 10840 non-null object 10 Last Updated 10840 non-null object 11 Current Ver 10832 non-null object 12 Android Ver 10838 non-null object dtypes: float64(3), int64(2), object(8) memory usage: 1.2+ MB
Extracting Date Features
The "Last Updated" column stores dates as human-readable strings ("January 7, 2018"). Parse them into proper datetime objects with pd.to_datetime, then extract the day, month, and year components as separate integer columns.
df_copy['Last Updated'] = pd.to_datetime(df_copy['Last Updated'])
df_copy['Day'] = df_copy['Last Updated'].dt.day
df_copy['Month'] = df_copy['Last Updated'].dt.month
df_copy['Year'] = df_copy['Last Updated'].dt.year
df_copy.info()<class 'pandas.core.frame.DataFrame'> Int64Index: 10840 entries, 0 to 10840 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 App 10840 non-null object 1 Category 10840 non-null object 2 Rating 9366 non-null float64 3 Reviews 10840 non-null int64 4 Size 9145 non-null float64 5 Installs 10840 non-null int64 6 Type 10839 non-null object 7 Price 10840 non-null float64 8 Content Rating 10840 non-null object 9 Genres 10840 non-null object 10 Last Updated 10840 non-null datetime64[ns] 11 Current Ver 10832 non-null object 12 Android Ver 10838 non-null object 13 Day 10840 non-null int64 14 Month 10840 non-null int64 15 Year 10840 non-null int64 dtypes: datetime64[ns](1), float64(3), int64(5), object(7) memory usage: 1.4+ MB
Preview the fully cleaned dataset before moving to exploration.
df_copy.head()| App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | Day | Month | Year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Photo Editor & Candy Camera & Grid & ScrapBook | ART_AND_DESIGN | 4.1 | 159 | 19000.0 | 10000 | Free | 0.0 | Everyone | Art & Design | 2018-01-07 | 1.0.0 | 4.0.3 and up | 7 | 1 | 2018 |
| 1 | Coloring book moana | ART_AND_DESIGN | 3.9 | 967 | 14000.0 | 500000 | Free | 0.0 | Everyone | Art & Design;Pretend Play | 2018-01-15 | 2.0.0 | 4.0.3 and up | 15 | 1 | 2018 |
| 2 | U Launcher Lite – FREE Live Cool Themes, Hide ... | ART_AND_DESIGN | 4.7 | 87510 | 8.7 | 5000000 | Free | 0.0 | Everyone | Art & Design | 2018-08-01 | 1.2.4 | 4.0.3 and up | 1 | 8 | 2018 |
| 3 | Sketch - Draw & Paint | ART_AND_DESIGN | 4.5 | 215644 | 25000.0 | 50000000 | Free | 0.0 | Teen | Art & Design | 2018-06-08 | Varies with device | 4.2 and up | 8 | 6 | 2018 |
| 4 | Pixel Draw - Number Art Coloring Book | ART_AND_DESIGN | 4.3 | 967 | 2.8 | 100000 | Free | 0.0 | Everyone | Art & Design;Creativity | 2018-06-20 | 1.1 | 4.4 and up | 20 | 6 | 2018 |
Duplicate Rows
The dataset contains 1,181 duplicate app entries. Keeping only the first occurrence per app name reduces the row count.
print('Duplicate apps:', df_copy[df_copy.duplicated('App')].shape[0])
df_copy = df_copy.drop_duplicates(subset=['App'], keep='first')
print('Rows after dedup:', df_copy.shape[0])Numeric and Categorical Columns
Classifying columns by their data type clarifies which transformations remain and which visualisation approaches apply.
numeric_features = [feat for feat in df_copy.columns if df_copy[feat].dtype != 'O']
categorical_features = [feat for feat in df_copy.columns if df_copy[feat].dtype == 'O']
print(f'We have {len(numeric_features)} numerical features : {numeric_features}')
print(f'\nWe have {len(categorical_features)} categorical features : {categorical_features}')We have 9 numerical features : ['Rating', 'Reviews', 'Size', 'Installs', 'Price', 'Last Updated', 'Day', 'Month', 'Year']We have 7 categorical features : ['App', 'Category', 'Type', 'Content Rating', 'Genres', 'Current Ver', 'Android Ver']
Categorical Distributions
Checking the proportion of each category reveals imbalances: 92.6% of apps are free and 80% target "Everyone".
for col in ['Type', 'Content Rating']:
print(df_copy[col].value_counts(normalize=True) * 100)
print('---------------------------')Free 92.610701 Paid 7.380074 0 0.009225 Name: Type, dtype: float64 --------------------------- Everyone 80.387454 Teen 11.143911 Mature 17+ 4.603321 Everyone 10+ 3.819188 Adults only 18+ 0.027675 Unrated 0.018450 Name: Content Rating, dtype: float64 ---------------------------
KDE plots for the numerical columns confirm the skew visible in the summary statistics.
plt.figure(figsize=(15, 15))
plt.suptitle('Univariate Analysis of Numerical Features', fontsize=20, fontweight='bold', alpha=0.8, y=1.)
for i in range(len(numeric_features)):
plt.subplot(5, 3, i+1)
sns.kdeplot(x=df_copy[numeric_features[i]], shade=True, color='r')
plt.xlabel(numeric_features[i])
plt.tight_layout()Rating and Year are left-skewed, while Reviews, Size, Installs, and Price are heavily right-skewed — typical for marketplace data where a small number of apps capture most of the traffic.
Most Popular App Categories
Family, Game, and Tools are the three largest categories by sheer number of apps.
category_counts = pd.DataFrame(df_copy['Category'].value_counts())
category_counts.rename(columns={'Category': 'Count'}, inplace=True)
plt.figure(figsize=(15, 12))
plt.subplot(1, 2, 1)
df_copy['Category'].value_counts().plot.pie(autopct='%1.1f', figsize=(15, 16))
plt.subplot(1, 2, 2)
sns.barplot(x=category_counts.index[:10], y='Count', data=category_counts[:10], palette='hls')
plt.title('Top 10 App Categories')
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()Category With the Most Installs
By aggregating install counts per category, Game emerges as the clear leader with nearly 35 billion installs.
df_cat_installs = df_copy.groupby(['Category'])['Installs'].sum().sort_values(ascending=False).reset_index()
df_cat_installs.Installs = df_cat_installs.Installs / 1_000_000_000
df2 = df_cat_installs.head(10)
plt.figure(figsize=(14, 10))
sns.set_context("talk")
sns.set_style("darkgrid")
ax = sns.barplot(x='Installs', y='Category', data=df2)
ax.set_xlabel('No. of Installations in Billions')
ax.set_ylabel('')
ax.set_title("Most Popular Categories in Play Store", size=20)
plt.show()Top 5 Most Installed Apps Per Category
Drilling into a few key categories — Game, Communication, Productivity, Social — shows which individual apps lead each segment.
dfa = df_copy.groupby(['Category', 'App'])['Installs'].sum().reset_index()
dfa = dfa.sort_values('Installs', ascending=False)
apps = ['GAME', 'COMMUNICATION', 'PRODUCTIVITY', 'SOCIAL']
sns.set_context("poster")
sns.set_style("darkgrid")
plt.figure(figsize=(40, 30))
for i, app in enumerate(apps):
df2 = dfa[dfa.Category == app]
df3 = df2.head(5)
plt.subplot(4, 2, i+1)
sns.barplot(data=df3, x='Installs', y='App')
plt.xlabel('Installation in Millions')
plt.ylabel('')
plt.title(app, size=20)
plt.tight_layout()
plt.subplots_adjust(hspace=.3)
plt.show()Five-Star Apps
Only 271 apps in the dataset hold a perfect 5.0 rating. The top-ranked five-star app by installs is "CS & IT Interview Questions" from the Family category.
rating = df_copy.groupby(['Category', 'Installs', 'App'])['Rating'].sum().sort_values(ascending=False).reset_index()
toprating_apps = rating[rating.Rating == 5.0]
print("Number of 5 rated apps", toprating_apps.shape[0])
toprating_apps.head(1)Number of 5 rated apps 271
| Category | Installs | App | Rating | |
|---|---|---|---|---|
| 0 | FAMILY | 1000 | CS & IT Interview Questions | 5.0 |
After cleaning, the dataset dropped from 10,841 to 9,659 unique apps with properly typed numeric columns and extracted date features. The remaining missing values (Rating, Size, Type, Current Ver, Android Ver) are candidates for imputation or exclusion depending on the modelling approach. The EDA confirms what you would expect from a marketplace: a long tail of small apps and a small head of dominant players in Games and Communication.