Having troubles reproducing results for m2m100 1.2b #3

dchaplinsky · 2023-12-04T20:15:51Z

I'm trying to reproduce the reported results for eng-ukr language pair for m2m100 on flores200 dataset but the score I get is much lower (26.8->21.0).

My setup is: cTranslate2, this model and HF's evaluate (the code is available here. The dataset is the same (Flores200, devtest).

My main suspects are:

Lower quality of the quantised m2m100 model
Different settings for the text generation (I'm using beams=5)
Different settings for BLEU scorer (ngrams, etc).

I've browsed the repos I found on opus-mt leaderboard and other seemingly relevant repos from Helsinki-NLP account. I also glimpsed through the main paper.

Could you please advise on the following things?

Where I can find generation/evaluation settings/code for the leaderboard?
Is there a file with the individual metrics per sentence pair?
Anything else you might remember or find relevant.

Thanks in advance!

jorgtied · 2023-12-25T21:34:12Z

I used the native transformers library for decoding the testsets and beam size 1 (if I remember correctly). BLEU scores are computed with sacrebleu and default settings. There are no individual scores per sentence pair.

dchaplinsky · 2024-01-07T21:46:28Z

Thanks. No source code left for the eval, so I can dig it myself rather than bothering you?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Having troubles reproducing results for m2m100 1.2b #3

Having troubles reproducing results for m2m100 1.2b #3

dchaplinsky commented Dec 4, 2023

jorgtied commented Dec 25, 2023

dchaplinsky commented Jan 7, 2024

Having troubles reproducing results for m2m100 1.2b #3

Having troubles reproducing results for m2m100 1.2b #3

Comments

dchaplinsky commented Dec 4, 2023

jorgtied commented Dec 25, 2023

dchaplinsky commented Jan 7, 2024