Sentiment-Analysis

This program uses Naive Bayes and Laplace smoothing to predict movie review sentiments based on the imdb dataset

If the jupyter notebook (main.ipynb) is not rendering, visit this link OR visit main.py in this repo

Getting started

clone or download this repository

git clone https://github.com/jadessechan/Sentiment-Analysis.git

open main.ipynb or main.py
run the code to see the final prediction for a movie review

Demo

In real world scenarios, class imbalances are very common but for the purposes of this project I chose a balanced data set:
The dataset is split equally between the positive and negative sentiment of 25,000 reviews each. I used the plotly library to graph this data

import plotly.express as px
px.histogram()

Here is the final output:

Implementation

Step 0:

perform EDA (exploratory data analysis)
plot and visualize main attributes of dataset using the pandas dataframe

Step 1:

split data for training and testing (90% delegated for training and 10% for testing) The sentiments are distributed randomly throughout the dataset, thankfully (less sorting to do)!

train = df.sample(frac=0.9)
test = df.sample(frac=0.1)

distinguish between positive and negative reviews

Step 2:

pre-process the training data
make 2 BOW models for each sentiment (Counter for python is a handy feature to quickly get a dictionary of Keys and its frequencies as Values)

from collections import Counter
Counter(text)

The functions to filter and clean the text are the same as the ones in my text prediciton program used for regex parsing. But unlike in the text prediction, I removed stopwords in order to ignore extraneous information

for words in tokens:
        if words not in stopwords:
            # lemmatize words
            output.append(wnl.lemmatize(words))

Step 3:

compute the probability of each class occurring in the data

Step 4:

predict on the testing set
compute error This is where the magic happens, featuring the Naive Bayes algorithm and Laplace smoothing!🪄

def make_class_prediction(tokens, counts, class_prob, class_count):
    """ compute the classification of each sentiment based on its probability in training set """

    prediction = 1
    text_counts = Counter(tokens)
    for word in text_counts:
        # get 'word' freq in the reviews for a given class, add 1 to smooth the value
        # add 1 smoothing prevents multiplying the prediction by 0 (in case 'word' is not in the training set)
        prediction *=  text_counts.get(word) * ((counts.get(word, 0) + 1) / (sum(counts.values()) + class_count))

    return prediction * class_prob

And finally, I calculated the algorithm's accuracy by comparing 2 lists that held the algorithm's classification decision and the acutal classificaion of each test review.

if predictions[i] != actual[i]:
            wrong += 1
percent_error = (wrong * 100) / len(train)

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
corpora		corpora
imgs		imgs
README.md		README.md
main.ipynb		main.ipynb
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sentiment-Analysis

Getting started

Demo

Implementation

Step 0:

Step 1:

Step 2:

Step 3:

Step 4:

About

Releases

Packages

Languages

jadessechan/Sentiment-Analysis

Folders and files

Latest commit

History

Repository files navigation

Sentiment-Analysis

Getting started

Demo

Implementation

Step 0:

Step 1:

Step 2:

Step 3:

Step 4:

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages