GitHub - a-sani/nlp-sentiment-analysis: Sentiment Analysis and Topic Classification of News Articles.

Data Scraping

Steps to Run

Install all requirements listed in requirements.txt

pip3 install -r requirements.txt
Ensure you're running python 3

python3 --version or which python3
Once all dependencies are installed, run the cleaning script that takes all the separate bbc text files and converts them into 1 csv file, if not done so already. This file relies on there being a bbc-data folder containing 5 subfolders with text files for all the different categories, which we have included. Inside the data_preparation folder:

python3 clean_data.py
Or just python depending on how you have your aliases set up.
This should create a file called bbc_articles.csv into the input folder.
These are the articles that will be used for sentiment analysis and topic classification, since they are labelled.
If you want, run the news website scraper in the data_preparation directory. ** Warning: this will take 2+ hours **

python3 scrape_data.py
This should create a file called scraped_news.csv in the input folder.
To run sentiment analysis, open sentiment.py in the sentiment_analysis folder and make sure it is reading the bbc data csv. Running it with the bbc data will produce a graph, with scraped data it will not. It takes 3-5 minutes to run depending on which data set you are using. This will output a number of csv's and graphs into the output/sentiment folder. _There is commented code to change between reading the bbc data or the scraped data

python3 sentiment.py
To see some statistics about the two datasets, run stats.py inside the data_preparation directory which outputs information to the terminal.

python3 stats.py
To train the SVC model using the clean data obtained from previous steps, run SVC.py inside the topic_classification directory. This will output two pickle files into the topic classification folder. This takes around 30 minutes, so if you don't want to wait, the model files can be downloaded from https://drive.google.com/drive/folders/1myTIUoPaSt1ujO9rfDI6IJIVPdYy5lt2?usp=sharing.

python3 svc.py
To train the two NMF models, run nmf.py inside the topic_classification directory. This will output two pickle files into the topic classification folder. This takes around 20 minutes, so if you don't want to wait, the model files can be downloaded from https://drive.google.com/drive/folders/1myTIUoPaSt1ujO9rfDI6IJIVPdYy5lt2?usp=sharing.

python3 nmf.py

To check the performance of the topic classification and see how accurate it is, run get_accuracy.py inside the topic_classification directory. This will output numerous graphs into the output folder. Predicted-vs-Category is most interesting.

python3 get_accuracy.py
Access the Map-based user application we made which allow users to filter articles based on topic and sentiment of news articles.

https://teletubbies-front-end.vercel.app/

As a footnote, helper.py is included because it contains functions we created which are used for cleaning and processing text in multiple places.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.vscode		.vscode
READMEs		READMEs
data_preparation		data_preparation
front_end		front_end
geocoding		geocoding
input		input
output		output
sentiment_analysis		sentiment_analysis
topic_classification		topic_classification
.gitignore		.gitignore
README.md		README.md
project.ipynb		project.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Scraping

About

Releases

Packages

Languages

a-sani/nlp-sentiment-analysis

Folders and files

Latest commit

History

Repository files navigation

Data Scraping

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages