ML Project 2: Unsupervised Topic Modeling of Video Game Related Articles

This repository contains the code for the ML4Science Project. In this project, we look at different ways of generating topics from a news corpus. The corpus contains Swiss articles related to video games from 1964 to 2018. Unsupervised models such as Latent Semantic Analysis and Latent Dirichlet Allocation are used to create topic representations.

1. Data

The dataset comes from the Impresso project. More details can be found here.

2. Libraries/How to Run

Our code requires many libraries, all of them available through conda or pip:

Conda conda intall <library-name>:
- numpy
- pandas
- seaborn
- matplotlib
- gensim
- notebook
- nltk
- vadersentiment
- vadersentiment-fr
- sklearn
- tensorflow
pip pip install <library-name>:
- bertopic
others:
- python -m spacy download fr_core_news_sm

All of the code can then be run as notebooks:

cd <path to the directory of this file>
conda activate # If conda not already active
jupyter notebook

Bertopic may cause problems in Jupyter, but can be run in Google Colab.

3. Repository Structure and Files

helpers folder

preprocessing.py: loading the data, cleaning the data, pipeline for NLP, creating a dictionary and corpus

models.py: LSA and LDA models, evaluation metrics

utilities.py: sentiment analysis helpers

models folder

pretrained LDA model
Brown clustering output

data folder

impresso3.csv: dataset

coherence folder

results from hyperparameter tuning that takes a significant amount of time to run

Notebooks

lda_tuning.ipynb : Latent Dirichlet Allocation training, tuning, exploration, plotting of topics over time

other_models.ipynb : Latent Semantic Analysis (gensim), Brown Clustering

sentiments.ipynb : Sentiment Analysis

Tf-idf.ipynb : Exploration of the data by projecting it in lower dimensions to identify topics and clusters (LSA)

Word2Vec.ipynb : Example of the usage of Word2Vec for word embedding

berttopicmodel.ipynb : Exploration using BERTopic

Additional files

Cluster_axis_1_4.html is the output of an interactive plot from the notebook which allows to see point with assigned journals and years from the tf-idf notebook.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ML Project 2: Unsupervised Topic Modeling of Video Game Related Articles

1. Data

2. Libraries/How to Run

3. Repository Structure and Files

helpers folder

models folder

data folder

coherence folder

Notebooks

Additional files

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
coherence		coherence
data		data
helpers		helpers
models		models
.gitignore		.gitignore
README.md		README.md
Tf-idf.ipynb		Tf-idf.ipynb
Word2Vec.ipynb		Word2Vec.ipynb
berttopicmodel.ipynb		berttopicmodel.ipynb
cluster_axis_1_4.html		cluster_axis_1_4.html
lda_tuning.ipynb		lda_tuning.ipynb
other_models.ipynb		other_models.ipynb
sentiments.ipynb		sentiments.ipynb

hugolan/Fork-ML-LSA-LDA

Folders and files

Latest commit

History

Repository files navigation

ML Project 2: Unsupervised Topic Modeling of Video Game Related Articles

1. Data

2. Libraries/How to Run

3. Repository Structure and Files

helpers folder

models folder

data folder

coherence folder

Notebooks

Additional files

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages