AIT726

HW1

In this assignment, you will build a naïve Bayes and a logistic regression classifier for sentiment classification. We are defining sentiment classification as two classes: positive and negative. Our data set consists of movie reviews. The zip directory for the data contains training and test datasets, where each file contains one movie review. You will build the model using training data and evaluate with test data. Training data contains 25000 reviews and test data contains 25000 reviews.
Models:
* Naive Bayes
* Logistic Regression
Vectors:
* Bag of Words - Frequency
* Bag of Words - Binary (is word present in document)
* Term Frequency Inverse Document Frequency
Preprocessing:
* Stemmed
* Unstemmed
Command to run the file:
python HW1.py

i. main - runs all of the functions
ii. get_trainandtest_vocabanddocs() - converts dataset into tokens (stemmed and unstemmed), creates megatraining document and extracts vocabulary
iii. get_vectors() - creates BOW and TFIDF vectors for test and train both stemmed and unstemmed
iv. get_class_priors() - calculates the class prior likelihoods for use in Naive Bayes predictions
v. get_perword_likelihood() - calculates dictionaries for each feature vector to be used in the Naive Bayes prediction calculation
vi. predict_NB() - predicts the class of all of the test documents for all of the feature vectors using Naive Bayes
vii. evaluate - returns accuracy and confusion matrix for predictions
viii. Logistic_Regression_L2_SGD - logistic regression model class used to create the model and form predictions on test vectors

Due to the size of the dataset, and the number of tokens we are required to keep, many of the operations when creating vectors utilize a large amount of RAM.
This code was tested on a machine with 64GB of DDR4 RAM. Variables are deleted throughout when they are not needed to save memory. Needed data structures are saved and loaded for later use.

Results:

Model	Accuracy
NB-NOSTEM-FREQ	80.8
NB-NOSTEM-BINARY	81.6
NB-NOSTEM-TFIDF	71.26
NB-STEM-FREQ	80.272
NB-STEM-BINARY	80.86
NB-STEM-TFIDF	68.968
LOGISTIC_FREQ_NOL2	82.488
LOGISTIC_TFIDF_NOL2	88.588
LOGISTIC_FREQ_STEM_NOL2	61.508
LOGISTIC_BINARY_STEM_NOL2	83.18
LOGISTIC_TFIDF_STEM_NOL2	88.144
LOGISTIC_FREQ_L2	55.316
LOGISTIC_BINARY_L2	78.18
LOGISTIC_TFIDF_L2	86.068
LOGISTIC_FREQ_STEM_L2	58.36
LOGISTIC_BINARY_STEM_L2	79.732
LOGISTIC_TFIDF_STEM_L2	85.592

Name		Name	Last commit message	Last commit date
Latest commit History 189 Commits
HW1		HW1
HW2		HW2
HW3		HW3
HW4		HW4
Project		Project
.gitignore		.gitignore
README.md		README.md
demos.py		demos.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AIT726

HW1

Results:

About

Releases

Packages

Contributors 3

Languages

nnick14/AIT726

Folders and files

Latest commit

History

Repository files navigation

AIT726

HW1

Results:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages