This project is about movive reviews sentiment analysis based on Transformer and ULMFiT model.
You can browse the full report from here.
To solve the long-term independence and reduce the computation, Google designs a new model in ML tasks, named Transformer. About this model's detalied introduction please refer my reporters.
Python3.6.0 + Pytorch 1.0.1. (Some other python and pytorch version should also be used, but some fuctions and libraries may have a little bit difference, we coding in python 3.6 and pytorch 1.0.1)
-
Please run train.py.
You can also change the parameters, e.g. batch_size=64, learning_rate=0.001, epoches=50, etc. The resultes, training model and processing data will be saved in the folder, you can assign the path by changing the saye_path parameter.
-
IMDB dataset and GloVe wording vectory will be download in the "root" path.
-
dataload.py will processing (wmbedding, tokenize, etc.) raw data.
-
model.py define and design the transformer netwoek.
The loss mean at each epoch (50 epoches total) in training data and the accuracy of verification data (every 5 epoch) will be saved in "root/results" path. The best training model will also be saved.
图1.Training loss and vertification accuray
After 50 epoches the training loss=0.161125, the varcification accuracy=88.036%.
ULMFiT model is introduced the pre-training and fine-tuning strategy to text classification tasks, we pre-train a general model in wiki-103 dataset and fine-tuning it on IMDB dataset, then training a classfication about sentiment analysis. There are some pre-training tricks presented in this paper. More detail about ULMFiT please refer my notebook.
Python 3.6.0 + Pytorch 1.0.1 + Fastai 1.0.51.
- Run ULMFiT_slim.py, you can assign the path to save the trained model and IMDB dataset. The processing data.csv (train and test set will be created randomly) will also be saved. You also can assign the batch_size, learning_rate, dropout etc.
- There is an other version to reliaze this model we can refere here.
epoch | train_loss | valid_loss | accuracy | time |
---|---|---|---|---|
10 | 0.882431 | 0.765422 | 0.901345 | 4:21:53 |