You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to reproduce the reported results for eng-ukr language pair for m2m100 on flores200 dataset but the score I get is much lower (26.8->21.0).
My setup is: cTranslate2, this model and HF's evaluate (the code is available here. The dataset is the same (Flores200, devtest).
My main suspects are:
Lower quality of the quantised m2m100 model
Different settings for the text generation (I'm using beams=5)
Different settings for BLEU scorer (ngrams, etc).
I've browsed the repos I found on opus-mt leaderboard and other seemingly relevant repos from Helsinki-NLP account. I also glimpsed through the main paper.
Could you please advise on the following things?
Where I can find generation/evaluation settings/code for the leaderboard?
Is there a file with the individual metrics per sentence pair?
Anything else you might remember or find relevant.
Thanks in advance!
The text was updated successfully, but these errors were encountered:
I used the native transformers library for decoding the testsets and beam size 1 (if I remember correctly). BLEU scores are computed with sacrebleu and default settings. There are no individual scores per sentence pair.
Hello @jorgtied!
I'm trying to reproduce the reported results for eng-ukr language pair for m2m100 on flores200 dataset but the score I get is much lower (26.8->21.0).
My setup is: cTranslate2, this model and HF's evaluate (the code is available here. The dataset is the same (Flores200, devtest).
My main suspects are:
I've browsed the repos I found on opus-mt leaderboard and other seemingly relevant repos from Helsinki-NLP account. I also glimpsed through the main paper.
Could you please advise on the following things?
Thanks in advance!
The text was updated successfully, but these errors were encountered: