Overview: Leaderboard release #1867

x-tabdeveloping · 2025-01-24T14:33:14Z

Since we would like to release the leaderboard as soon as possible (especially since the paper got accepted for ICLR), I would love to open a discussion about what we consider to be the minimum requirements for publishing the new leaderboard.
I highly doubt that we will be able to fix all issues right away, but we should, in any case focus in on a couple of them that are crucial for the new leaderboard to be in a releasable state.

Here are some of my criteria:

VITAL PROBLEMS:

We need to fix task aggregation. This has been fixed by introducing aggregated tasks and manually aggregating results on CQADupstackRetrieval,
Some models (e.g. Jasper, voyage-2-large-instruct, Cohere, etc.) are still missing scores on MSMARCO. This has been fixed to a certain extent, but some models were not run on the dev split of MSMARCO, these need to be run Run MSMARCO dev split on some models #1898
We should create documentation on how to submit new models and results, and should direct people to the docs from the leaderboard. (possibly issue templates) Update docs to cover new model result submission #1868
We should add task metadata to all tasks in MTEB(eng, classic) Missing Metadata for tasks in MTEB(eng, classic) #1886 (@imenelydiaker is working on this fix: Filling missing metadata for leaderboard release #1895 )

I have tried implementing as many model metas the last couple of days as humanly possible, but this has been incredibly time consuming. If you still see models missing that you think should definitely be there, then feel free to comment here.

Nice to haves:

Cross encoder filtering is missing (New leaderboard filtering cross-encoders #1841 ) This doesn't work in the old leaderboard either, so we are technically not behind with this, but it would be great to get it to work in the new one.
New banner (New leaderboard banner #1855 ) Probably not too difficult to make one.
[Leaderboard] Autocalculate memory usage in model meta #1935

THIS IS JUST MY JUDGEMENT, PLEASE FEEL FREE TO ADD THINGS, I TOTALLY MIGHT BE MISSING SOMETHING

@Samoed @KennethEnevoldsen @Muennighoff @orionw @isaac-chung @imenelydiaker @tomaarsen

The text was updated successfully, but these errors were encountered:

isaac-chung · 2025-01-24T15:12:47Z

@x-tabdeveloping thanks for suggesting these! I agree with the vital list here, and can help with 3 (docs) and/or 1 (agg).

Re: 4, what would that look like? Maybe a) disable the update cron and b) add a banner/message to the app to link to the new leaderboard?

There's also a list of "must have's" + "nice to have's" issues kept in this comment, and the only must have left seems to be related to missing model results. It would be great if we can update the linked issue and establish what are the must haves within it.

Muennighoff · 2025-01-24T15:19:04Z

Great overview; Maybe an alternative to freezing is just focusing on all other issues first and then when everything else is done at the end, we could go do another round of syncing?

Samoed · 2025-01-24T16:30:27Z

For jasper and voyage they evaluated only test set of MSMARCO, but in leaderboard we recently changed it for dev #1620

x-tabdeveloping · 2025-01-24T17:44:54Z

Hmm strange, but why do they show up in the old leaderboard then? Shouldn't we strive for 100% feature parity?

Samoed · 2025-01-24T18:12:53Z

I couldn’t find it initially, but it seems we do have their scores on the dev split. However, when loading with res = mteb.load_results(models=["infgrad/jasper_en_vision_language_v1"], tasks=["MSMARCO"]), I see the log message: MSMARCO: Missing splits {'dev'}. Maybe it's loading only one revision of results

isaac-chung · 2025-01-25T05:57:16Z

Root cause

It seems like the results file linked in the comment above is from the revision external. The results file of the other 2 model revisions of the jasper model did not contain dev splits. In fact, this file must have been loaded to yield the error message above.

Proposed fix

Rerun MSMARCO on the latest model revision, or specify the external revision which contains a dev split result.

x-tabdeveloping · 2025-01-27T13:20:56Z

We can overwrite Jasper's and Voyage's revision in the metadata to external, then that would get highest precedence when loading results. I think this would be the most painless, though not an optimal solution. What do you think @isaac-chung @Samoed ?

x-tabdeveloping · 2025-01-27T13:24:12Z

Or another one would be to delete the result files without the dev split from the results repo.

isaac-chung · 2025-01-27T13:40:48Z

@x-tabdeveloping the overwrite option is fine and I think we can go for it, but note that it'll only buy us some time: anyone who runs these models will produce result files under the 'external' revision, which is not desirable.

Let's open an issue so that we would eventually rerun these models on MSMARCO with non-external revisions as well. How does that sound?

x-tabdeveloping · 2025-01-27T13:48:28Z

How about we just remove the newer results on the problematic tasks from the results folder? Then we can rerun in the future if need be, and if people run the models now, they will get the correct revision.
(also note that I don't think we have an actionable implementation of Jasper in mteb as of yet due to it being a multimodal model)

isaac-chung · 2025-01-27T14:07:55Z

That sounds good.

x-tabdeveloping · 2025-01-28T08:54:19Z

I've managed to find another pretty burning issue that we need to fix before launching the leaderboard: #1886
A lot of tasks in the MTEB(eng, classic) benchmark are missing a lot of task metadata, including domains, which is vital to the leaderboard filtering process.

ArxivClusteringS2S.domains = None
AskUbuntuDupQuestions.domains = None
BIOSSES.domains = None
CQADupstackAndroidRetrieval.domains = None
CQADupstackEnglishRetrieval.domains = None
CQADupstackGamingRetrieval.domains = None
CQADupstackGisRetrieval.domains = None
CQADupstackMathematicaRetrieval.domains = None
CQADupstackPhysicsRetrieval.domains = None
CQADupstackStatsRetrieval.domains = None
CQADupstackTexRetrieval.domains = None
CQADupstackUnixRetrieval.domains = None
CQADupstackWebmastersRetrieval.domains = None
CQADupstackWordpressRetrieval.domains = None
ClimateFEVER.domains = None
FEVER.domains = None
FiQA2018.domains = None
NQ.domains = None
QuoraRetrieval.domains = None
RedditClustering.domains = None
RedditClusteringP2P.domains = None
STSBenchmark.domains = None
StackExchangeClustering.domains = None
StackExchangeClusteringP2P.domains = None
StackOverflowDupQuestions.domains = None
TwitterSemEval2015.domains = None
TwitterURLCorpus.domains = None
MSMARCO.domains = None

imenelydiaker · 2025-01-28T13:57:40Z

@x-tabdeveloping I'll fill them out, I think I did some of them for the paper and forgot to add them to TaskMetadata, my bad 😅
On which branch shoud I push the changes? v2.0.0?

x-tabdeveloping · 2025-01-28T14:11:01Z

I think main @imenelydiaker ! So that we can release the leaderboard

x-tabdeveloping · 2025-01-28T14:13:46Z

@isaac-chung @Samoed I might be able to fix the issue with the results in code, I will update you about it.

x-tabdeveloping · 2025-01-28T14:37:50Z

Okay, so I have fixed the cases where the results are present, but in the external results folder.
On the other hand, for some models, like voyage2-large-instruct we are missing the dev split completely on MSMARCO.
How can it be the case that it is present in the old leaderboard if we don't have the scores on the dev split??

Samoed · 2025-01-28T15:28:58Z

I believe this is a bug, as the test split for voyage-large-2-instruct is not found in the MSMARCO results repository. The results repository checks if the specified split is present in the results dictionary with the default test, but this time it could not find dev split and used the test as fallback, because it is present in dict.

x-tabdeveloping · 2025-01-28T15:40:19Z

So, in conclusion, it is a bug with the old leaderboard, and the only way to go about fixing it is for us to run MSMARCO's dev split on these models. Is this a correct assessment?

Samoed · 2025-01-28T15:42:53Z

Unfortunately yes

wissam-sib · 2025-01-28T17:35:03Z

@x-tabdeveloping I can work on the banner, except if there is something more pressing I can help with (seems like other items are in the works).

x-tabdeveloping · 2025-01-28T20:04:18Z

Sure thing @wissam-sib! By all means go ahead

wissam-sib · 2025-01-30T09:06:41Z

Sure thing @wissam-sib! By all means go ahead

Cool, I've started here: #1908

isaac-chung · 2025-01-31T11:06:34Z

Looks like we're down to the last vital issue before the release!

x-tabdeveloping · 2025-01-31T11:15:49Z

@Muennighoff Can you help us out with it? Some models don't have MSMARCO results at all on the dev split, and we might need to run them.

Muennighoff · 2025-01-31T22:57:19Z

Yes will try to run them this weekend! Amazing work on everything 🚀🚀🚀

KennethEnevoldsen · 2025-02-01T12:20:44Z

The leaderboard is getting really close to being ready. @x-tabdeveloping and I manually reviewed each leaderboard to compare and found a few remaining issues. They are generally an issue of specification differences between the benchmarks.py and the current v1 of the leaderboard. We also have a few missing results. Some of which Niklas @Muennighoff is rerunning, but others are newer model releases (<1 month old). For these, we have reached out to the authors to let them know about the changes. For the inconsistencies, we have asked the benchmark contact (e.g., @imenelydiaker for French) to points to clarify which version is desired.

We are planning to do the release on Tuesday next week.

There are a few missing scores and inconsistencies:

Russian: Some newer model releases
MTEB(eng, classic): Run MSMARCO dev split on some models #1898
French: French leaderboard inconsistencies #1919
Polish: Polish leaderboard and benchmark does not match #1917

x-tabdeveloping added the leaderboard issues related to the leaderboard label Jan 24, 2025

Samoed pinned this issue Jan 24, 2025

imenelydiaker mentioned this issue Jan 28, 2025

Missing Metadata for tasks in MTEB(eng, classic) #1886

Closed

x-tabdeveloping mentioned this issue Jan 29, 2025

Run MSMARCO dev split on some models #1898

Open

isaac-chung mentioned this issue Jan 31, 2025

Overview issue: Leaderboard 2.0 release #1405

Closed

8 tasks

This was referenced Feb 1, 2025

Polish leaderboard and benchmark does not match #1917

Closed

Missing results on new leaderboard #1840

Closed

Missing models on leaderboards [WIP] #1848

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overview: Leaderboard release #1867

Overview: Leaderboard release #1867

x-tabdeveloping commented Jan 24, 2025 •

edited by Samoed

Loading

isaac-chung commented Jan 24, 2025

Muennighoff commented Jan 24, 2025

Samoed commented Jan 24, 2025

x-tabdeveloping commented Jan 24, 2025

Samoed commented Jan 24, 2025

isaac-chung commented Jan 25, 2025 •

edited

Loading

x-tabdeveloping commented Jan 27, 2025

x-tabdeveloping commented Jan 27, 2025

isaac-chung commented Jan 27, 2025

x-tabdeveloping commented Jan 27, 2025 •

edited

Loading

isaac-chung commented Jan 27, 2025

x-tabdeveloping commented Jan 28, 2025 •

edited

Loading

imenelydiaker commented Jan 28, 2025

x-tabdeveloping commented Jan 28, 2025

x-tabdeveloping commented Jan 28, 2025

x-tabdeveloping commented Jan 28, 2025 •

edited

Loading

Samoed commented Jan 28, 2025

x-tabdeveloping commented Jan 28, 2025

Samoed commented Jan 28, 2025

wissam-sib commented Jan 28, 2025

x-tabdeveloping commented Jan 28, 2025

wissam-sib commented Jan 30, 2025

isaac-chung commented Jan 31, 2025

x-tabdeveloping commented Jan 31, 2025

Muennighoff commented Jan 31, 2025

KennethEnevoldsen commented Feb 1, 2025 •

edited by x-tabdeveloping

Loading

Overview: Leaderboard release #1867

Overview: Leaderboard release #1867

Comments

x-tabdeveloping commented Jan 24, 2025 • edited by Samoed Loading

VITAL PROBLEMS:

Nice to haves:

isaac-chung commented Jan 24, 2025

Muennighoff commented Jan 24, 2025

Samoed commented Jan 24, 2025

x-tabdeveloping commented Jan 24, 2025

Samoed commented Jan 24, 2025

isaac-chung commented Jan 25, 2025 • edited Loading

Root cause

Proposed fix

x-tabdeveloping commented Jan 27, 2025

x-tabdeveloping commented Jan 27, 2025

isaac-chung commented Jan 27, 2025

x-tabdeveloping commented Jan 27, 2025 • edited Loading

isaac-chung commented Jan 27, 2025

x-tabdeveloping commented Jan 28, 2025 • edited Loading

imenelydiaker commented Jan 28, 2025

x-tabdeveloping commented Jan 28, 2025

x-tabdeveloping commented Jan 28, 2025

x-tabdeveloping commented Jan 28, 2025 • edited Loading

Samoed commented Jan 28, 2025

x-tabdeveloping commented Jan 28, 2025

Samoed commented Jan 28, 2025

wissam-sib commented Jan 28, 2025

x-tabdeveloping commented Jan 28, 2025

wissam-sib commented Jan 30, 2025

isaac-chung commented Jan 31, 2025

x-tabdeveloping commented Jan 31, 2025

Muennighoff commented Jan 31, 2025

KennethEnevoldsen commented Feb 1, 2025 • edited by x-tabdeveloping Loading

x-tabdeveloping commented Jan 24, 2025 •

edited by Samoed

Loading

isaac-chung commented Jan 25, 2025 •

edited

Loading

x-tabdeveloping commented Jan 27, 2025 •

edited

Loading

x-tabdeveloping commented Jan 28, 2025 •

edited

Loading

x-tabdeveloping commented Jan 28, 2025 •

edited

Loading

KennethEnevoldsen commented Feb 1, 2025 •

edited by x-tabdeveloping

Loading