Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overview: Leaderboard release #1867

Open
4 of 7 tasks
x-tabdeveloping opened this issue Jan 24, 2025 · 26 comments
Open
4 of 7 tasks

Overview: Leaderboard release #1867

x-tabdeveloping opened this issue Jan 24, 2025 · 26 comments
Labels
leaderboard issues related to the leaderboard

Comments

@x-tabdeveloping
Copy link
Collaborator

x-tabdeveloping commented Jan 24, 2025

Since we would like to release the leaderboard as soon as possible (especially since the paper got accepted for ICLR), I would love to open a discussion about what we consider to be the minimum requirements for publishing the new leaderboard.
I highly doubt that we will be able to fix all issues right away, but we should, in any case focus in on a couple of them that are crucial for the new leaderboard to be in a releasable state.

Here are some of my criteria:

VITAL PROBLEMS:

I have tried implementing as many model metas the last couple of days as humanly possible, but this has been incredibly time consuming. If you still see models missing that you think should definitely be there, then feel free to comment here.

Nice to haves:

THIS IS JUST MY JUDGEMENT, PLEASE FEEL FREE TO ADD THINGS, I TOTALLY MIGHT BE MISSING SOMETHING

@Samoed @KennethEnevoldsen @Muennighoff @orionw @isaac-chung @imenelydiaker @tomaarsen

@x-tabdeveloping x-tabdeveloping added the leaderboard issues related to the leaderboard label Jan 24, 2025
@isaac-chung
Copy link
Collaborator

@x-tabdeveloping thanks for suggesting these! I agree with the vital list here, and can help with 3 (docs) and/or 1 (agg).

Re: 4, what would that look like? Maybe a) disable the update cron and b) add a banner/message to the app to link to the new leaderboard?

There's also a list of "must have's" + "nice to have's" issues kept in this comment, and the only must have left seems to be related to missing model results. It would be great if we can update the linked issue and establish what are the must haves within it.

@Muennighoff
Copy link
Contributor

Great overview; Maybe an alternative to freezing is just focusing on all other issues first and then when everything else is done at the end, we could go do another round of syncing?

@Samoed
Copy link
Collaborator

Samoed commented Jan 24, 2025

For jasper and voyage they evaluated only test set of MSMARCO, but in leaderboard we recently changed it for dev #1620

@Samoed Samoed pinned this issue Jan 24, 2025
@x-tabdeveloping
Copy link
Collaborator Author

Hmm strange, but why do they show up in the old leaderboard then? Shouldn't we strive for 100% feature parity?

@Samoed
Copy link
Collaborator

Samoed commented Jan 24, 2025

I couldn’t find it initially, but it seems we do have their scores on the dev split. However, when loading with res = mteb.load_results(models=["infgrad/jasper_en_vision_language_v1"], tasks=["MSMARCO"]), I see the log message: MSMARCO: Missing splits {'dev'}. Maybe it's loading only one revision of results

@isaac-chung
Copy link
Collaborator

isaac-chung commented Jan 25, 2025

Root cause

It seems like the results file linked in the comment above is from the revision external. The results file of the other 2 model revisions of the jasper model did not contain dev splits. In fact, this file must have been loaded to yield the error message above.

Proposed fix

Rerun MSMARCO on the latest model revision, or specify the external revision which contains a dev split result.

@x-tabdeveloping
Copy link
Collaborator Author

We can overwrite Jasper's and Voyage's revision in the metadata to external, then that would get highest precedence when loading results. I think this would be the most painless, though not an optimal solution. What do you think @isaac-chung @Samoed ?

@x-tabdeveloping
Copy link
Collaborator Author

Or another one would be to delete the result files without the dev split from the results repo.

@isaac-chung
Copy link
Collaborator

@x-tabdeveloping the overwrite option is fine and I think we can go for it, but note that it'll only buy us some time: anyone who runs these models will produce result files under the 'external' revision, which is not desirable.

Let's open an issue so that we would eventually rerun these models on MSMARCO with non-external revisions as well. How does that sound?

@x-tabdeveloping
Copy link
Collaborator Author

x-tabdeveloping commented Jan 27, 2025

How about we just remove the newer results on the problematic tasks from the results folder? Then we can rerun in the future if need be, and if people run the models now, they will get the correct revision.
(also note that I don't think we have an actionable implementation of Jasper in mteb as of yet due to it being a multimodal model)

@isaac-chung
Copy link
Collaborator

That sounds good.

@x-tabdeveloping
Copy link
Collaborator Author

x-tabdeveloping commented Jan 28, 2025

I've managed to find another pretty burning issue that we need to fix before launching the leaderboard: #1886
A lot of tasks in the MTEB(eng, classic) benchmark are missing a lot of task metadata, including domains, which is vital to the leaderboard filtering process.

ArxivClusteringS2S.domains = None
AskUbuntuDupQuestions.domains = None
BIOSSES.domains = None
CQADupstackAndroidRetrieval.domains = None
CQADupstackEnglishRetrieval.domains = None
CQADupstackGamingRetrieval.domains = None
CQADupstackGisRetrieval.domains = None
CQADupstackMathematicaRetrieval.domains = None
CQADupstackPhysicsRetrieval.domains = None
CQADupstackStatsRetrieval.domains = None
CQADupstackTexRetrieval.domains = None
CQADupstackUnixRetrieval.domains = None
CQADupstackWebmastersRetrieval.domains = None
CQADupstackWordpressRetrieval.domains = None
ClimateFEVER.domains = None
FEVER.domains = None
FiQA2018.domains = None
NQ.domains = None
QuoraRetrieval.domains = None
RedditClustering.domains = None
RedditClusteringP2P.domains = None
STSBenchmark.domains = None
StackExchangeClustering.domains = None
StackExchangeClusteringP2P.domains = None
StackOverflowDupQuestions.domains = None
TwitterSemEval2015.domains = None
TwitterURLCorpus.domains = None
MSMARCO.domains = None

@imenelydiaker
Copy link
Contributor

@x-tabdeveloping I'll fill them out, I think I did some of them for the paper and forgot to add them to TaskMetadata, my bad 😅
On which branch shoud I push the changes? v2.0.0?

@x-tabdeveloping
Copy link
Collaborator Author

I think main @imenelydiaker ! So that we can release the leaderboard

@x-tabdeveloping
Copy link
Collaborator Author

@isaac-chung @Samoed I might be able to fix the issue with the results in code, I will update you about it.

@x-tabdeveloping
Copy link
Collaborator Author

x-tabdeveloping commented Jan 28, 2025

Okay, so I have fixed the cases where the results are present, but in the external results folder.
On the other hand, for some models, like voyage2-large-instruct we are missing the dev split completely on MSMARCO.
How can it be the case that it is present in the old leaderboard if we don't have the scores on the dev split??

@Samoed
Copy link
Collaborator

Samoed commented Jan 28, 2025

I believe this is a bug, as the test split for voyage-large-2-instruct is not found in the MSMARCO results repository. The results repository checks if the specified split is present in the results dictionary with the default test, but this time it could not find dev split and used the test as fallback, because it is present in dict.

@x-tabdeveloping
Copy link
Collaborator Author

So, in conclusion, it is a bug with the old leaderboard, and the only way to go about fixing it is for us to run MSMARCO's dev split on these models. Is this a correct assessment?

@Samoed
Copy link
Collaborator

Samoed commented Jan 28, 2025

Unfortunately yes

@wissam-sib
Copy link
Contributor

@x-tabdeveloping I can work on the banner, except if there is something more pressing I can help with (seems like other items are in the works).

@x-tabdeveloping
Copy link
Collaborator Author

Sure thing @wissam-sib! By all means go ahead

@wissam-sib
Copy link
Contributor

Sure thing @wissam-sib! By all means go ahead

Cool, I've started here: #1908

@isaac-chung
Copy link
Collaborator

Looks like we're down to the last vital issue before the release!

@x-tabdeveloping
Copy link
Collaborator Author

@Muennighoff Can you help us out with it? Some models don't have MSMARCO results at all on the dev split, and we might need to run them.

@Muennighoff
Copy link
Contributor

Yes will try to run them this weekend! Amazing work on everything 🚀🚀🚀

@KennethEnevoldsen
Copy link
Contributor

KennethEnevoldsen commented Feb 1, 2025

The leaderboard is getting really close to being ready. @x-tabdeveloping and I manually reviewed each leaderboard to compare and found a few remaining issues. They are generally an issue of specification differences between the benchmarks.py and the current v1 of the leaderboard. We also have a few missing results. Some of which Niklas @Muennighoff is rerunning, but others are newer model releases (<1 month old). For these, we have reached out to the authors to let them know about the changes. For the inconsistencies, we have asked the benchmark contact (e.g., @imenelydiaker for French) to points to clarify which version is desired.

We are planning to do the release on Tuesday next week.

There are a few missing scores and inconsistencies:

  1. Russian: Some newer model releases
  2. MTEB(eng, classic): Run MSMARCO dev split on some models #1898
  3. French: French leaderboard inconsistencies #1919
  4. Polish: Polish leaderboard and benchmark does not match #1917

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
leaderboard issues related to the leaderboard
Projects
None yet
Development

No branches or pull requests

7 participants