The purpose of this project is to analyze Amazon reviews written by members of the paid Amazon Vine program, a service that allows manufacturers and publishers to receive reviews of their products and determine if there are any biases between Vine members and Non-Vine member's reviews.
Companies that will pay a fee to Amazon and may provide free products to Vine members who are then required to publish a review. In order to determine if there is any bias towards favorable reviews from Vine members vs. non-members, we need to identify the percentage of 5 star ratings to total rating. As part of this exercise, we were asked to choose from 50 datasets to extract, transform and load into a dataframe in order to complete our analysis. Throughout this analysis, we use:
- PySpark to extract the dataset, transform the data, connect to AWS RDS instance and load the transformed data into pgAdmin.
- Google Colaboratory to import PySpark libraries and connect to Postgres in order to create SQL tables and export the results.
Out of the 50 datasets, I chose to analyze reviews that were made by users in the "grocery" category.
The url for the Dataset used for this analysis is https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Grocery_v1_00.tsv.gz
Our chosen dataset has around 2.4 million reviews recorded. We filtered our results to keep the most helpful ones by Count of Total Votes equal to or greater than 20 and Percent of Helpful Votes to Total Votes equal or greater than 50%.
Our findings are as follow:
- There are 61 total paid Vine members reviews and 28287 unpaid Vine members reviews as shown in the code below.
- There are 20 five stars paid Vine members reviews and there are 15,689 five stars paid Vine members reviews.
- 33 percentage of Vine reviews were 5 stars and 55 percentage of non-Vine reviews were 5 stars.
Our analysis shows that there is no strong bias toward five-star reviews from paid Amazon Vine reviewers according to the results of 33 percentage of Vine reviews were 5 stars and 55 percentage of non-Vine reviews were 5 stars. This conclusion could be further examined by looking at the distribution of all star-levels across paid and unpaid reviews. Also, for a more thorough analysis, this same meta-analysis should be conducted across a few different grocery product catagories.