This is official Github repository of team hate-alert which ranked 1st in the shared task on Abusive and Threatening language detection in Urdu
at the FIRE-2021(CICLing 2021 track @ FIRE 2021 co-hosted with ODS SoC 2021), part of the FIRE 2021 conference.
Social media often acts as breeding grounds for different forms of abusive content. For low resource languages like Urdu the situation is more complex due to the poor performance of multilingual or language-specific models and lack of proper benchmark datasets. Based on this shared task HASOC - Abusive and Threatening language detection in Urdu
at FIRE 2021, we present an exhaustive exploration of different machine learning models, Our models trained separately for each task and secured the 1st position
in both abusive and threat detection task in Urdu.
Our paper can be found here.
The shared tasks present in this competition are divided into two parts. Where in one part participants have to focus on detecting Abusive language using twitter tweets in Urdu language (Subtask A) and in other part mainly focusing on detecting Threatening language using Twitter tweets in Urdu language (Subtask B). To download the data, go to the following link.
In this section, we discuss the different parts of thepipeline that we followed to detect offensive posts in this dataset
As a part of our initial experiments, we used several machine learning models to establish a baseline per-formance. We employed XGBoost, LGBM and trained them with pre-trained Urdu laser embedding. The best results were obtained on XGBoost Classifier with 0.760 and 0.247 F1-scores on abusive and threat detection respectively.
We fine-tuned state-of-the-art multilingual BERT model on the given datasets. The beauty of the mBERT is it is pretrained in unsupervised manner on multilingual corpus. Besides we have used another m-BERT based model which is previously fine-tuned on Arabic hate speech date set. The model has been referred as Hate-speech-CNERG/dehatebert-mono-arabic' model
The motivation of using the following model in Arabic language because it is origin of Urdu, so further fine-tuning the model with the Urdu dataset may yield better performance.
Results of different models on private test dataset can be found here: The results have been in terms of the F1 scores.
Classifiers | Abusive | Threat |
---|---|---|
XGBoost | 0.7602 | 0.2471 |
LGBM | 0.7666 | 0.2047 |
mBERT | 0.8400 | 0.4696 |
dehatebert-mono-arabic | 0.8806 | 0.5457 |
├── TransformerBasedModel/
├── README.md
└── LICENSE
Please consider citing this project in your publications if it helps your research.
@inproceedings{Das2021AbusiveAT,
title={Abusive and Threatening Language Detection in Urdu using Boosting based and BERT based models: A Comparative Approach},
author={Das, Mithun and Banerjee, Somnath and Saha, Punyajoy},
journal={arXiv preprint arXiv:2111.14830},
year={2021}
}
Additionally, we would like to extend a big thanks to the makers and maintainers of the excellent HuggingFace repository, without which most of our research would have been impossible.