Platform-Invariant Topic Modeling via Contrastive Learning to Mitigate Platform-Induced Bias

Implementation of Platform-Invariant Topic Modeling.

This study enhances the performance of topic models by developing a platform-invariant contrastive learning algorithm and eliminating platform-specific jargon sets to minimize the unique influence of each platform.

Data

For this study, data was collected directly from three platforms: X, Reddit, and YouTube, respectively. You can find three datasets in the sub-directory named data. Each file is raw data collected from three platforms (X, Reddit, YouTube) using the keyword “ChatGPT”. You can utilize this data when implementing the PITopic.

Usage

PITopic.ipynb is the main notebook for training and evaluating the model. This code contains a platform-invariant contrastive learning algorithm and removes platform-specific jargon word sets.

This research enhances the potential for robust social analysis across diverse platforms by contributing to more accurate and unbiased topic discovery.

How to Run

To reproduce the results from the real-world data shown in the paper, follow the steps below:

Data Setup

Replace the paths for the three collected platform data CSV files (data/twitter_total_preprocessed.csv, data/reddit_total_preprocessed_cleaned.csv, data/youtube_preprocessed.csv) in the code with the paths where your data is stored.

1. Setting the path of the model codes in PITopic.ipynb.
 (i.e., The location of the two codes that upload the data required for PITopic models and organize the structure of PITopic are “/models/data.py” and “/models/model.py”, respectively, which can be set to “models” in PITopic.ipynb.)
2. python PITopic.ipynb -- After completing the first step, you can run PITopic.ipynb to see the topics extracted by the model.

How to Evaluate

1. In order to quantitatively analyze the topic words extracted by the model, we can measure the topic coherence using the code located in “models/coherence.py”. 
The measures we used are Mutual Information, Topic Fiversity, and Topic coherence (NPMI, UCI).

The average of NPMI and UCI was calculated for each platform separately, as well as across all platforms collectively. NPMI quantifies the frequency of word co-occurrence within a specific topic, while UCI assesses the frequency of topic co-occurrence across various documents; higher values indicate better topic coherence. PITopic demonstrates good performance across all measures of topic coherence(see the below example result).

Requirements

PyTorch == 2.4.1
numpy == 1.26.4
pandas == 2.2.2
scikit-learn == 1.3.2
gensim == 4.2.0

Should you have any questions or comments, please contact us at the following email address: [email protected], [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 91 Commits
PITopic		PITopic
data		data
image		image
.gitattributes		.gitattributes
PITopic.ipynb		PITopic.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Platform-Invariant Topic Modeling via Contrastive Learning to Mitigate Platform-Induced Bias

Data

Usage

How to Run

Data Setup

How to Evaluate

Requirements

About

Releases

Packages

Contributors 2

Languages

kde9867/Platform-Invariant-Topic-Modeling

Folders and files

Latest commit

History

Repository files navigation

Platform-Invariant Topic Modeling via Contrastive Learning to Mitigate Platform-Induced Bias

Data

Usage

How to Run

Data Setup

How to Evaluate

Requirements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages