This is an A-scored course project of Artificial Intelligence: Principles and Technique course in IIIS, Tsinghua University taught by Prof. Chongjie Zhang. I will briefly introduce our work in the following texts.
There are two ways to implement AI, one is to translate the knowledge we have directly into an algorithm, and the other is to use a model with learning capabilities using a selected dataset to allow the machine to learn the knowledge on its own. For natural language processing problems, the latter is the popular way of implementation. However, to get a good model trained on a dataset, it is necessary to make the selected dataset as close to the actual problem as possible, or in the wild. However, as we see in the image below, machine learning models can be misled by the imbalance of the dataset (and indeed humans can make similar mistakes).
Methods have been proposed to remove the misleading effect of writing style and the number of sentence occurrences on the model. Our work focuses on the ways in which the presence or absence of certain neutral words can affect the model's predictions, and how to remove this effect (sometimes misleading).
Firstly, we propose some assumptions to model this problem. Suppose
Moreover, it is natural that the sampling intention is completely determined by the sampling strategy. Actually, we assume that the annotator selected data purposefully. Therefore, we have the following equation:
Also, it is natural that the label is independent of our sampling strategy (in our problem, the sampling strategy is completely determined by some features in
According to the assumptions we proposed, there are six variables in total. In the original distribution, we have variables
Further, we assume that the relationship between
See the poster (address at the end of the article) for the specific weights assignment method.
The above image shows the F1 scores of the two models before and after the debiasing on the three datasets, and we can see that the model after the debiasing clearly performs better, which indicates that the debiased model generalizes better. However, this is not enough to show that we have successfully removed the model bias caused by the presence or absence of neutral words, so we have completed another experiment. This experiment looked at the change in model prediction by replacing neutral words with unit vectors from word2vec. Admittedly, this setting introduces structural changes to the sentences, but we argue here that the effect on sentence structure is negligible for sentiment analysis tasks.
As shown above, in the new model, the effect of our selected neutral words on the model is somewhat reduced.
[1] ZHANG G, BAI B, LIANG J, et al. Selection Bias Explorations and Debias Methods for Natural Language Sentence Matching Datasets[J/OL]. CoRR, 2019, abs/1905.06221. arXiv:1905.06221. http://arxiv.org/abs/1905.06221.
[2] PEARL J. Causality[M]. [S.l.]: Cambridge university press, 2009.
[3] Utagh. IMDB Review Dataset[EB/OL]. 2020. https://www.kaggle.com/utathya/imdb-review-dataset.
[4] Kaggle. Bag of Words Meets Bags of Popcorn[EB/OL]. 2020. https://www.kaggle.com/c/word2vec-nlp-tutorial.
[5] Liu J. 515K Hotel Reviews Data in Europe[EB/OL]. 2020. https://www.kaggle.com/jiashenliu/515k-hotel-reviews-data-in-europe.
[6] VICTOR V, DHANYA S, DAVID B. Using Text Embeddings for Causal Inference[J/OL]. CoRR, 2019, abs/1905.12741. arXiv:1905.12741. http://arxiv.org/abs/1905.12741.
Download Poster (Please wait for a few seconds)
Download Technical Report (Please wait for a few seconds)