This program uses Naive Bayes and Laplace smoothing to predict movie review sentiments based on the imdb dataset
If the jupyter notebook (main.ipynb) is not rendering, visit this link OR visit main.py in this repo
- clone or download this repository
git clone https://github.com/jadessechan/Sentiment-Analysis.git
- open main.ipynb or main.py
- run the code to see the final prediction for a movie review
In real world scenarios, class imbalances are very common but for the purposes of this project I chose a balanced data set:
The dataset is split equally between the positive and negative sentiment of 25,000 reviews each.
I used the plotly library to graph this data
import plotly.express as px
px.histogram()
- perform EDA (exploratory data analysis)
- plot and visualize main attributes of dataset using the pandas dataframe
- split data for training and testing (90% delegated for training and 10% for testing) The sentiments are distributed randomly throughout the dataset, thankfully (less sorting to do)!
train = df.sample(frac=0.9)
test = df.sample(frac=0.1)
- distinguish between positive and negative reviews
- pre-process the training data
- make 2 BOW models for each sentiment (Counter for python is a handy feature to quickly get a dictionary of Keys and its frequencies as Values)
from collections import Counter
Counter(text)
- The functions to filter and clean the text are the same as the ones in my text prediciton program used for regex parsing. But unlike in the text prediction, I removed stopwords in order to ignore extraneous information
for words in tokens:
if words not in stopwords:
# lemmatize words
output.append(wnl.lemmatize(words))
- compute the probability of each class occurring in the data
- predict on the testing set
- compute error This is where the magic happens, featuring the Naive Bayes algorithm and Laplace smoothing!🪄
def make_class_prediction(tokens, counts, class_prob, class_count):
""" compute the classification of each sentiment based on its probability in training set """
prediction = 1
text_counts = Counter(tokens)
for word in text_counts:
# get 'word' freq in the reviews for a given class, add 1 to smooth the value
# add 1 smoothing prevents multiplying the prediction by 0 (in case 'word' is not in the training set)
prediction *= text_counts.get(word) * ((counts.get(word, 0) + 1) / (sum(counts.values()) + class_count))
return prediction * class_prob
And finally, I calculated the algorithm's accuracy by comparing 2 lists that held the algorithm's classification decision and the acutal classificaion of each test review.
if predictions[i] != actual[i]:
wrong += 1
percent_error = (wrong * 100) / len(train)