Google Play Store Apps

4 min readJan 28, 2024

Assalamu alaikum to everyone!

In this post, we’ll work with a dataset of apps from the Google Play Store. Our goal is to sort the apps in the store by different ratings.

In our work, the program is written in the Python programming language and we run it in a Jupyter notebook.

First of all, we install the necessary libraries.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Let’s start by downloading the dataset.

def load_dataset():
    df= pd.read_csv('https://drive.google.com/uc?id=1qVdE4PshjtnHt2NThURaM7ZZgw-9o1G7')
    return df
df = load_dataset()

**We’ve transferred the dataset from CSV to a Pandas DataFrame object, and the data in it looks like this.**

Now let’s get acquainted with the dataset information.

def print_summarize_dataset(dataset):
    print('Dataset shape: ', dataset.shape, '\n\n')
    print('Dataset info:', dataset.info(), '\n\n')
    print('NaN values count:\n', dataset.isna().sum())
print_summarize_dataset(df)

As can be seen from the above data, we have 10841 rows and 13 columns. We also noticed that among these data there are some incomplete ones. In addition, their Dtype should also be converted to the correct format.

Now let’s process our data.

def clean_dataset(dataset):
    dataset.drop_duplicates('App', inplace=True)
    dataset.drop(['Last Updated', 'Current Ver'], axis=1, inplace=True)
    dataset.dropna(axis = 0, inplace = True)
    dataset["Reviews"] = dataset["Reviews"].astype(int)
    dataset_size = dataset[dataset["Size"].str.contains("k", regex = False)]
    dataset_size["Size"] = dataset_size["Size"].str.replace("k", "", regex = False).astype(float)/1000
    df["Size"].update(dataset_size["Size"])
    dataset["Size"] = dataset["Size"].str.replace("Varies with device", "NaN", regex = False).str.replace("M", "").astype(float)
    dataset["Installs"] = dataset["Installs"].str.replace("+", "").str.replace(",", "", regex = False).astype(int)
    dataset["Price"] = dataset["Price"].str.replace("$", "", regex = False).astype(float)
    dataset["Content Rating"] = dataset["Content Rating"].str.replace("Everyone +10", "Everyone", regex = False).str.replace("+", "")
    dataset.reset_index(drop=True, inplace=True)
    return dataset
cleaned_data = clean_dataset(df)

This function takes a dataset and processes it to return us clean data. Our function discards unnecessary columns and duplicates, also discards incomplete rows and corrects the data types.

As you can see from the image above, we are left with 8194 rows and 11 columns. Now we can perform calculations with a clean dataset.

First of all, let’s calculate the correlations between the data.

def compute_correlations_matrix(dataset):
    return dataset.corr(numeric_only=True)
cor = compute_correlations_matrix(df)

Numbers are good for calculation, but it is more convenient for us to analyze visually than these numbers.

def print_scatter_matrix():
    sns.heatmap(cor, annot = True, cmap = 'Spectral')
    plt.show()
scatter = print_scatter_matrix()

As you can clearly see from the figure, the relationship between the Reviews and Installs columns is the best. This is natural, because the more downloads, the more reviews.

It’s time to analyze

Now, after cleaning the data, I start asking special questions and try to answer them analytically.

Questions:

1.1 What is the Category of most of the applications in the store?

1.2 What is the most expensive and most downloaded app in this category?

1.3 What are the most common genres in this category?

1.4 How many downloads are there for these genres?

2. What are the top download categories in the store?

3. What are the most profitable categories in the store?

4. What is the price of Category’s maximum selling apps?

Answers:

1.1 The most participated category is the Family category.

1.2 Minecraft is the most downloaded and most expensive app in the Family category.

1.3 The most common genres in the Family category are Role Playing and Education.

1.4 Among the most downloaded genres in the Family category, the Arcade & Action & Adventure genre is the leader with 10110000 downloads.

2. The most downloaded categories in the store are Game and Communication categories.

3. The most profitable category in the store is the Finance category.

4. The most expensive apps sold in the store are in the Lifestyle, Family and Finance categories.

And finally the report we were asked for is ready. But I also analyzed more interesting data to learn. I’ll leave a link to my GitHub page if you’re interested. It will be open to everyone. Visit and enjoy. If you like what I’ve done, don’t forget to click like. GOOD LUCK

link: https://github.com/ASapayev/02-Data-Science-My-Mobapp-Studio-/tree/Abdukarim_Sapayev