Course Name: Data Science Lab (R4EC3012P)
Date: January-May 2023
Exploratory data analysis (EDA) is used by data scientists to analyse and investigate data sets for patterns, and anomalies (outliers), and form hypotheses based on our understanding of the dataset and summarize their main characteristics, often employing data visualization methods.
- About the Project
- Getting Started
- Theory and Approach
- Results and Outcomes
- Contributors
- Acknowledgements and Resources
We perform Exploratory Data Analysis (EDA) on the Google Play Store data and produce some results and outcomes.
The objective of this experiment is to deliver insights to understand customer demands better and thus help application developers to popularize their products. In this project we examine the different attributes present in the data set that affect the popularity of the application. We focused on to answer the questions like,
- Which category has the greatest number of installations?
- How many free apps does the Play Store have?
- Which is the most common category of apps on Play Store?
- Which is the most expensive category?
- Which category has the highest number of reviews on Play Store?
- Should have python environment. You can refer here for the setup.
- Python librairies
- NumPy
pip install numpy
- Seaborn
pip install seaborn
- Pandas
pip install pandas
- Matplotlib
pip install matplotlib
- NumPy
For installation of pip you can refer here
Clone the repo
git clone https://github.com/Yash-Desh/Google-Playstore-EDA.git
Our data set contains a large number of null values in the rating column, so we drop them. Some of the columns have a smaller number of null values, so we replace the null values in these columns with the mode value of that particular column. Our data set also contains duplicate rows for a single application. We also drop the duplicate rows because the rows contain the identical data. Also drop the rows, which have rating greater than 5.
Outlier is a data object that deviates significantly from the rest of the data objects and behaves in a different manner. They can be caused by measurement or execution errors. The analysis of outlier data is referred to as outlier analysis or outlier mining. We find Point Outlier in our dataset by giving the condition of Ratings greater than 5
df[df.Ratings>5]
Box plot and Histogram when outliers present:
df.drop([10472], inplace=True)
df[10470:10475]
Box plot and Histogram when outliers removed:
Charts and graphs helped us uncover valuable insights from this complex data, revealing patterns and connections within user engagement metrics like app installations, ratings, and reviews across different categories. This approach highlighted popular app genres and showed how various factors impact app performance. Visualizing market trends and distribution of key variables guided decisions on app development, marketing, and user experience. Ultimately, these visuals provided a compelling way to communicate our findings, supporting the conclusions drawn from the analysis.
The following graphs depict the results of the visualization:
- Category VS Install:
- Category VS Pricing:
- Category VS Reviews:
Most of the apps are free so developers should focus on creating free apps to have a huge customer base. More Apps should be in the category like Events, Beauty, Parenting as they have not been explored much but still quite popular with huge installations. In order to retain the customer base apps should be updated regularly Developers should develop apps such that their content is available for everyone.
- Most common category of apps on the Play Store is: Family
- Percentage of free apps on the Play Store is: ~92%
- Category with the greatest number of app installs on the Play Store is: Communication
- Category with the greatest number of reviews is: Communication
- Category with the most expensive apps on the Play Store is: Finance
- Chirag Patil
- Yash Deshpande
- Atharva Bendre
- Shreyas Bhatlawande