license is chosen based on the kaggle rules Winner License Type: Open Source - MIT
What to do to train a model
Models are trained by modifying
val_fold
andconfig
on top oftrain_script.py
in corresponding folder and then running
# GPUN being a gpu number
python train_script.py GPUN &
Script assumes that there is a checkpoints
directory in the same location
Whatever that will help understanding
of the codebase and easily start based on it
python lets_do_this
feedback
βββ kaggle_inference_notebooks # inference notebooks of each models for kaggle
β βββ deberta
β βββ longformer
β βββ xlnet
β βββ ... # TODO: Add more models
β
βββ models_training # Test files (alternatively `spec` or `tests`)
β βββ deberta
β βββ longformer
β β βββ longformer # Original Longformer Code
β β β
β β βββ submission # Group of codes for Logformer model submssion
β β β βββ codes # Modified Longformer & Huggingface Code
β β β βββ pretrained_checkpoints #
β β β βββ tokenizer #
β β β βββ weights #
β β β
β β βββ submission_large # same as above `submission`
β β βββ ...
β βββ xlnet
β βββ ... # TODO: Add more models
β β
β βββ oof (out of fold) #
β βββ post processing #
β
βββ train.csv
βββ check_and_split_data.ipynb
check_and_split_data.ipynb
was used to make splits.- it is not deterministic due to rapids umap, so produced splits also included in that folder.
- rapids umap code is mostly taken from kaggle notebook - cdeotte/rapids-umap-tfidf-kmeans-discovers-15-topics
train.csv
is slightly cleaner version of public train file.- train.csv was made semi-manually after searching for entities where the symbol before first letter of discourse_text was alphanumeric.
- Has several columns related to the
gt label
, hosts-provided target isdiscourse_text
, what been scored is an overlap withpredictionstring
- Those columns are all a
noisy target
,discourse_text
worked best in preliminary tests.
data_rev1.csv
- Made in similar process when looking for starts/ends of discourse_text split in
train.csv
- For samples where
discourse_text
starts 1 word beforepunctuation mark
or ends 1 word afterpunctuation mark
data_rev1.csv
was made with a script inlongformer
directory and newtrain.csv
with the same as for debertav3 except for character replacement
- Made in similar process when looking for starts/ends of discourse_text split in
Deberta
- not deterministic, yet better results, faster training and faster submission as well.Longformers
- training scripts in longformer directory are deterministic, but slowxlnet
- ...- TODO: Add more models
- other models with relative positional encoding are ernie series from baidu
- Longformer, BigBird, ETC, are based on
roberta
checkpoints
- Training scripts are in
models_training
.- Includes some modified import codes in
./models_training/longformer/submission
folder. - Training data for
longformer
and fordebertav1
is made by the script in longformer folder, as it was assumed that tokenizers are identical. - Also, when making that particular data, original
train.csv
was used.
- Includes some modified import codes in
deberta
folder- has a notebook to make data for debertav3.
longformer
folder./models_training/longformer/sumbission/codes/new_transformers_branch/transformers
is from mingboiz/transformer
xlnet
folder- contains
check_labels.ipynb
which is used to sanity check produced data files. - Also has a notebook to prepare training data.
- contains
- submission notebooks in
code/kaggle_inference_notebooks
- submission time
longformer
- 40 minutes for 1 folddebertav1
- 22 minutes for 1 fold
- Make sure
entities
start from an alphanumeric character - class weights
- label smoothing
- global attention to
sep
/cls
token and [.?!] tokens for longformer - swa ( sliding windows version of )
- reverse cross entropy
- reverse cross entropy appears to have speed up convergence, maybe reduce number of epochs to 7 or less
- Making sure that tokenization of
xlnet
anddebertav3
preserves newlines, otherwise severe drop in performance
- mixup - briefly tried, looks like same results
- cleaning unicode artefacts in data with ftfy and regex substitutions
Model | Fold | Epochs | Training Time (h) | Val | CV | LB | Special note |
---|---|---|---|---|---|---|---|
Xlnet | 5 | - | rtx3090 x 1 19h | - | - | - | |
Longformer | 5 | - | rtx3090 x 1 19h30 | - | - | 0.670 | with bug entity |
Debertav1 | 5 | - | rtx3090 x 1 13h | - | - | 0.678 | with bug entity |
Debertav1 | 5 | - | rtx3090 x 1 13h | - | - | 0.681 | partially fixed entity extraction |
Debertav1 | 5 | - | rtx3090 x 1 13h | - | 0.69724 | 0.699 | fixed entity extraction + adding filtering based on minimal number of words in predicted entity and some confidence thresholds |
Longformer + Debertav1 | 5 | - | - | - | 0.69945 | 0.700 | fixed entity extraction + adding filtering based on minimal number of words in predicted entity and some confidence thresholds |
- The code used to find thresholds was
ad-hoc
, does not optimize correct metric - The above models were validated using the bugged entity extraction code, so the models may be suboptimal.
- Training of xlnet looks deterministic
- RAM
- 4 xlnets in parallel training takes 220gb of ram
- 4 debertav1 barely fit in 256gb
- 4 debertav3 will likely not fit
- Wandb Logs
- Finish training
xlnet
and train adebertav3
- Training one more transformer with adding predicted probability weighted embeddings of predicted token types to the word embeddings as a stacking model
Q : ../../data_rev1.csv
file used in prepare_data_for_longformer_bkpv1.ipynb
(which makes train data for longformer and debertav1), the same file as train.csv
?
Almost same, use train.csv
labels format used was:
0- outside
1 - b-lead
2 - i-lead
3 - b-position
4 - i-position, etc.
When scanning the argmaxed prediction, new entity is started when an odd prediction is encountered when it's 0
or prediction != current category + 1
.
bugged version had only check for odd number. you can see bugged version in train scipt of longformer
and debertav1
, function extract_entities
fixed version in train script of xlnet. Fixed version checks for an odd prediction, when it's 0 or when prediction != current category + 1.
i.e. if the prediction was 1 2 2 4 6 8 10 0 0 0 3 4 4 ...
- old code would extract
entities:
1: [0 - 9, ...],
3: [10 - ...]
- new code would extract
entities:
1: [0 - 2, ...],
3: [10 - ...]
Q : Why does the performance is similar or better when newline (\n)
is recognized in the deberta then longformer?
In the longformer
the same tokenizer as in roberta
is used. that one is also used for debertav1
, and the tokenizer preserves newlines.
when using xlnet
tokenizer or debertav3
tokenizer, the newlines are gone.
summary
longformer
-\n
toekn as newlineroberta
-\n
token as newlinedebertav1
-\n
token as newlineXlnet
-<eop>
token as a newlinedebertav3
-[MASK]
token as a newline
Overall deberta
produces better results all models are trained with max_len 2048
Submission with .700
score has longformer model as well
Note that from tvm import te
is different from import tvm as te
. Library namespace had changed. Few years ago in tvm variable was made with tvm.var
, in latest release it is tvm.te.var
but current longformer library still uses tvm.var
.
tvm.var
turned out useless.
- Custom gpu kernel turned out useless as while taking less gpu ram for training it's also slower and not deterministic.
- That file is needed to build and compile custom gpu kernel
So to use tvm.te.var
the following had been made
# before
import tvm
b = tvm.var('b')
# after
from tvm import te
b = te.var('b')
- ./models_training/longformer/longformer/longformer/longformer.py#L187-L188
- ./models_training/longformer/longformer/longformer/longformer.py#L263-L264
Other changes to that code ( some indexing modification and attention mask broadcasts ) were to make the code work with torch.use_deterministic_algorithms(True)
to make training deterministic when using global attention. Also there is a crucial semicolon on line 264.