Back to blog
← View series: machine learning

~/blog

Cleaning and Transforming Google Play Store Data for Analysis

Jun 1, 20269 min readBy Mohammed Vasim
Machine LearningAIData Science

The Google Play Store hosts over 2.5 million apps. That volume makes it a rich dataset for understanding what drives app popularity, but the raw CSV is far from analysis-ready. Numbers arrive as strings, sizes mix "M" and "k" suffixes, install counts include commas and plus signs, and one row has its columns shifted entirely out of alignment.

This walkthrough covers the full cleaning pipeline — parsing, converting, and validating each column — so the dataset is ready for exploratory analysis and modelling.

Data Overview

The dataset contains 10,841 rows and 13 columns covering app name, category, rating, number of reviews, size, install count, type (free or paid), price, content rating, genres, last-updated date, current version, and minimum Android version.

We need the usual data-wrangling tools plus visualisation libraries for the exploratory phase.

python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

warnings.filterwarnings("ignore")
%matplotlib inline

Load the dataset directly from the public URL and take a first look at the structure.

python
df = pd.read_csv('https://raw.githubusercontent.com/krishnaik06/playstore-Dataset/main/googleplaystore.csv')
df.head()
AppCategoryRatingReviewsSizeInstallsTypePriceContent RatingGenresLast UpdatedCurrent VerAndroid Ver
0Photo Editor & Candy Camera & Grid & ScrapBookART_AND_DESIGN4.115919M10,000+Free0EveryoneArt & DesignJanuary 7, 20181.0.04.0.3 and up
1Coloring book moanaART_AND_DESIGN3.996714M500,000+Free0EveryoneArt & Design;Pretend PlayJanuary 15, 20182.0.04.0.3 and up
2U Launcher Lite – FREE Live Cool Themes, Hide ...ART_AND_DESIGN4.7875108.7M5,000,000+Free0EveryoneArt & DesignAugust 1, 20181.2.44.0.3 and up
3Sketch - Draw & PaintART_AND_DESIGN4.521564425M50,000,000+Free0TeenArt & DesignJune 8, 2018Varies with device4.2 and up
4Pixel Draw - Number Art Coloring BookART_AND_DESIGN4.39672.8M100,000+Free0EveryoneArt & Design;CreativityJune 20, 20181.14.4 and up

Before any cleaning, check the dataset dimensions, column types, summary statistics, and missing-value counts. This gives a baseline for what needs fixing.

python
print(df.shape)
print()
df.info()
print()
print(df.describe())
print()
print(df.isnull().sum())

A Misaligned Row

Real-world data frequently arrives with structural defects. Row 10,472 has values shifted one column to the left: "1.9" appears under the Category column, "19.0" under Rating, "3.0M" under Reviews, and so on. Every value is one position off. This row cannot be salvaged by parsing — it has to be removed.

python
df[~df['Reviews'].str.isnumeric()]
AppCategoryRatingReviewsSizeInstallsTypePriceContent RatingGenresLast UpdatedCurrent VerAndroid Ver
10472Life Made WI-Fi Touchscreen Photo Frame1.919.03.0M1,000+Free0EveryoneNaNFebruary 11, 20181.0.194.0 and upNaN

Drop the bad row on a copy of the data, verify no non-numeric review strings remain, then cast the column to int.

python
df_copy = df.copy()
df_copy = df_copy.drop(df_copy.index[10472])
# Verify the bad row is gone
assert len(df_copy[~df_copy['Reviews'].str.isnumeric()]) == 0
df_copy['Reviews'] = df_copy['Reviews'].astype(int)
df_copy.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10840 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10840 non-null  object 
 1   Category        10840 non-null  object 
 2   Rating          9366 non-null   float64
 3   Reviews         10840 non-null  int64  
 4   Size            10840 non-null  object 
 5   Installs        10840 non-null  object 
 6   Type            10839 non-null  object 
 7   Price           10840 non-null  object 
 8   Content Rating  10840 non-null  object 
 9   Genres          10840 non-null  object 
 10  Last Updated    10840 non-null  object 
 11  Current Ver     10832 non-null  object 
 12  Android Ver     10838 non-null  object 
dtypes: float64(1), int64(1), object(11)
memory usage: 1.2+ MB

Size: From Strings to Numbers

The Size column uses suffixes "M" (megabytes), "k" (kilobytes), and the literal string "Varies with device". To make it numerically usable, replace "M" with "000" (converting MB to KB), strip "k", mark "Varies with device" as NaN, and cast to float. Note that values originally ending in "M" and a fraction — "8.7M" — become 8.7 after the replacement, indicating that the raw data stores some sizes in MB and others in KB inconsistently.

python
df_copy['Size'] = df_copy['Size'].str.replace('M', '000')
df_copy['Size'] = df_copy['Size'].str.replace('k', '')
df_copy['Size'] = df_copy['Size'].replace('Varies with device', np.nan)
df_copy['Size'] = df_copy['Size'].astype(float)

Installs and Price

Both columns carry non-numeric characters: "+", ",", and "$". Strip them all in one pass, then cast to the appropriate numeric types.

python
chars_to_remove = ['+', ',', '$']
cols_to_clean = ['Installs', 'Price']
for char in chars_to_remove:
    for col in cols_to_clean:
        df_copy[col] = df_copy[col].str.replace(char, '')

df_copy['Installs'] = df_copy['Installs'].astype('int')
df_copy['Price'] = df_copy['Price'].astype('float')
df_copy.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10840 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10840 non-null  object 
 1   Category        10840 non-null  object 
 2   Rating          9366 non-null   float64
 3   Reviews         10840 non-null  int64  
 4   Size            9145 non-null   float64
 5   Installs        10840 non-null  int64  
 6   Type            10839 non-null  object 
 7   Price           10840 non-null  float64
 8   Content Rating  10840 non-null  object 
 9   Genres          10840 non-null  object 
 10  Last Updated    10840 non-null  object 
 11  Current Ver     10832 non-null  object 
 12  Android Ver     10838 non-null  object 
dtypes: float64(3), int64(2), object(8)
memory usage: 1.2+ MB

Extracting Date Features

The "Last Updated" column stores dates as human-readable strings ("January 7, 2018"). Parse them into proper datetime objects with pd.to_datetime, then extract the day, month, and year components as separate integer columns.

python
df_copy['Last Updated'] = pd.to_datetime(df_copy['Last Updated'])
df_copy['Day'] = df_copy['Last Updated'].dt.day
df_copy['Month'] = df_copy['Last Updated'].dt.month
df_copy['Year'] = df_copy['Last Updated'].dt.year
df_copy.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10840 entries, 0 to 10840
Data columns (total 16 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   App             10840 non-null  object        
 1   Category        10840 non-null  object        
 2   Rating          9366 non-null   float64       
 3   Reviews         10840 non-null  int64         
 4   Size            9145 non-null   float64       
 5   Installs        10840 non-null  int64         
 6   Type            10839 non-null  object        
 7   Price           10840 non-null  float64       
 8   Content Rating  10840 non-null  object        
 9   Genres          10840 non-null  object        
 10  Last Updated    10840 non-null  datetime64[ns]
 11  Current Ver     10832 non-null  object        
 12  Android Ver     10838 non-null  object        
 13  Day             10840 non-null  int64         
 14  Month           10840 non-null  int64         
 15  Year            10840 non-null  int64         
dtypes: datetime64[ns](1), float64(3), int64(5), object(7)
memory usage: 1.4+ MB

Preview the fully cleaned dataset before moving to exploration.

python
df_copy.head()
AppCategoryRatingReviewsSizeInstallsTypePriceContent RatingGenresLast UpdatedCurrent VerAndroid VerDayMonthYear
0Photo Editor & Candy Camera & Grid & ScrapBookART_AND_DESIGN4.115919000.010000Free0.0EveryoneArt & Design2018-01-071.0.04.0.3 and up712018
1Coloring book moanaART_AND_DESIGN3.996714000.0500000Free0.0EveryoneArt & Design;Pretend Play2018-01-152.0.04.0.3 and up1512018
2U Launcher Lite – FREE Live Cool Themes, Hide ...ART_AND_DESIGN4.7875108.75000000Free0.0EveryoneArt & Design2018-08-011.2.44.0.3 and up182018
3Sketch - Draw & PaintART_AND_DESIGN4.521564425000.050000000Free0.0TeenArt & Design2018-06-08Varies with device4.2 and up862018
4Pixel Draw - Number Art Coloring BookART_AND_DESIGN4.39672.8100000Free0.0EveryoneArt & Design;Creativity2018-06-201.14.4 and up2062018

Duplicate Rows

The dataset contains 1,181 duplicate app entries. Keeping only the first occurrence per app name reduces the row count.

python
print('Duplicate apps:', df_copy[df_copy.duplicated('App')].shape[0])
df_copy = df_copy.drop_duplicates(subset=['App'], keep='first')
print('Rows after dedup:', df_copy.shape[0])

Numeric and Categorical Columns

Classifying columns by their data type clarifies which transformations remain and which visualisation approaches apply.

python
numeric_features = [feat for feat in df_copy.columns if df_copy[feat].dtype != 'O']
categorical_features = [feat for feat in df_copy.columns if df_copy[feat].dtype == 'O']

print(f'We have {len(numeric_features)} numerical features : {numeric_features}')
print(f'\nWe have {len(categorical_features)} categorical features : {categorical_features}')
We have 9 numerical features : ['Rating', 'Reviews', 'Size', 'Installs', 'Price', 'Last Updated', 'Day', 'Month', 'Year']

We have 7 categorical features : ['App', 'Category', 'Type', 'Content Rating', 'Genres', 'Current Ver', 'Android Ver']

Categorical Distributions

Checking the proportion of each category reveals imbalances: 92.6% of apps are free and 80% target "Everyone".

python
for col in ['Type', 'Content Rating']:
    print(df_copy[col].value_counts(normalize=True) * 100)
    print('---------------------------')
Free    92.610701
Paid     7.380074
0        0.009225
Name: Type, dtype: float64
---------------------------
Everyone           80.387454
Teen               11.143911
Mature 17+          4.603321
Everyone 10+        3.819188
Adults only 18+     0.027675
Unrated             0.018450
Name: Content Rating, dtype: float64
---------------------------

KDE plots for the numerical columns confirm the skew visible in the summary statistics.

python
plt.figure(figsize=(15, 15))
plt.suptitle('Univariate Analysis of Numerical Features', fontsize=20, fontweight='bold', alpha=0.8, y=1.)

for i in range(len(numeric_features)):
    plt.subplot(5, 3, i+1)
    sns.kdeplot(x=df_copy[numeric_features[i]], shade=True, color='r')
    plt.xlabel(numeric_features[i])
    plt.tight_layout()

Rating and Year are left-skewed, while Reviews, Size, Installs, and Price are heavily right-skewed — typical for marketplace data where a small number of apps capture most of the traffic.

Family, Game, and Tools are the three largest categories by sheer number of apps.

python
category_counts = pd.DataFrame(df_copy['Category'].value_counts())
category_counts.rename(columns={'Category': 'Count'}, inplace=True)

plt.figure(figsize=(15, 12))
plt.subplot(1, 2, 1)
df_copy['Category'].value_counts().plot.pie(autopct='%1.1f', figsize=(15, 16))

plt.subplot(1, 2, 2)
sns.barplot(x=category_counts.index[:10], y='Count', data=category_counts[:10], palette='hls')
plt.title('Top 10 App Categories')
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()

Category With the Most Installs

By aggregating install counts per category, Game emerges as the clear leader with nearly 35 billion installs.

python
df_cat_installs = df_copy.groupby(['Category'])['Installs'].sum().sort_values(ascending=False).reset_index()
df_cat_installs.Installs = df_cat_installs.Installs / 1_000_000_000
df2 = df_cat_installs.head(10)

plt.figure(figsize=(14, 10))
sns.set_context("talk")
sns.set_style("darkgrid")

ax = sns.barplot(x='Installs', y='Category', data=df2)
ax.set_xlabel('No. of Installations in Billions')
ax.set_ylabel('')
ax.set_title("Most Popular Categories in Play Store", size=20)
plt.show()

Top 5 Most Installed Apps Per Category

Drilling into a few key categories — Game, Communication, Productivity, Social — shows which individual apps lead each segment.

python
dfa = df_copy.groupby(['Category', 'App'])['Installs'].sum().reset_index()
dfa = dfa.sort_values('Installs', ascending=False)
apps = ['GAME', 'COMMUNICATION', 'PRODUCTIVITY', 'SOCIAL']

sns.set_context("poster")
sns.set_style("darkgrid")

plt.figure(figsize=(40, 30))

for i, app in enumerate(apps):
    df2 = dfa[dfa.Category == app]
    df3 = df2.head(5)
    plt.subplot(4, 2, i+1)
    sns.barplot(data=df3, x='Installs', y='App')
    plt.xlabel('Installation in Millions')
    plt.ylabel('')
    plt.title(app, size=20)

plt.tight_layout()
plt.subplots_adjust(hspace=.3)
plt.show()

Five-Star Apps

Only 271 apps in the dataset hold a perfect 5.0 rating. The top-ranked five-star app by installs is "CS & IT Interview Questions" from the Family category.

python
rating = df_copy.groupby(['Category', 'Installs', 'App'])['Rating'].sum().sort_values(ascending=False).reset_index()
toprating_apps = rating[rating.Rating == 5.0]
print("Number of 5 rated apps", toprating_apps.shape[0])
toprating_apps.head(1)
Number of 5 rated apps 271
CategoryInstallsAppRating
0FAMILY1000CS & IT Interview Questions5.0

After cleaning, the dataset dropped from 10,841 to 9,659 unique apps with properly typed numeric columns and extracted date features. The remaining missing values (Rating, Size, Type, Current Ver, Android Ver) are candidates for imputation or exclusion depending on the modelling approach. The EDA confirms what you would expect from a marketplace: a long tail of small apps and a small head of dominant players in Games and Communication.

Comments (0)

No comments yet. Be the first to comment!

Leave a comment