Steps to Run
Install all requirements listed in requirements.txt
pip3 install -r requirements.txt
Ensure you're running python 3
python3 --version
orwhich python3
Once all dependencies are installed, run the cleaning script that takes all the separate bbc text files and converts them into 1 csv file, if not done so already. This file relies on there being a bbc-data folder containing 5 subfolders with text files for all the different categories, which we have included. Inside the data_preparation folder:
Or justpython
depending on how you have your aliases set up. -
This should create a file called
into the input folder.
These are the articles that will be used for sentiment analysis and topic classification, since they are labelled. -
If you want, run the news website scraper in the data_preparation directory. ** Warning: this will take 2+ hours **
This should create a file called
in the input folder. -
To run sentiment analysis, open
in the sentiment_analysis folder and make sure it is reading the bbc data csv. Running it with the bbc data will produce a graph, with scraped data it will not. It takes 3-5 minutes to run depending on which data set you are using. This will output a number of csv's and graphs into the output/sentiment folder. _There is commented code to change between reading the bbc data or the scraped datapython3
To see some statistics about the two datasets, run
inside the data_preparation directory which outputs information to the terminal.python3
To train the SVC model using the clean data obtained from previous steps, run
inside the topic_classification directory. This will output two pickle files into the topic classification folder. This takes around 30 minutes, so if you don't want to wait, the model files can be downloaded from
To train the two NMF models, run
inside the topic_classification directory. This will output two pickle files into the topic classification folder. This takes around 20 minutes, so if you don't want to wait, the model files can be downloaded from
To check the performance of the topic classification and see how accurate it is, run
inside the topic_classification directory. This will output numerous graphs into the output folder. Predicted-vs-Category is most interesting.python3
Access the Map-based user application we made which allow users to filter articles based on topic and sentiment of news articles.
As a footnote,
is included because it contains functions we created which are used for cleaning and processing text in multiple places.