-
Notifications
You must be signed in to change notification settings - Fork 313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overview: Leaderboard release #1867
Comments
@x-tabdeveloping thanks for suggesting these! I agree with the vital list here, and can help with 3 (docs) and/or 1 (agg). Re: 4, what would that look like? Maybe a) disable the update cron and b) add a banner/message to the app to link to the new leaderboard? There's also a list of "must have's" + "nice to have's" issues kept in this comment, and the only must have left seems to be related to missing model results. It would be great if we can update the linked issue and establish what are the must haves within it. |
Great overview; Maybe an alternative to freezing is just focusing on all other issues first and then when everything else is done at the end, we could go do another round of syncing? |
Hmm strange, but why do they show up in the old leaderboard then? Shouldn't we strive for 100% feature parity? |
I couldn’t find it initially, but it seems we do have their scores on the |
Root causeIt seems like the results file linked in the comment above is from the revision Proposed fixRerun MSMARCO on the latest model revision, or specify the external revision which contains a dev split result. |
We can overwrite Jasper's and Voyage's revision in the metadata to |
Or another one would be to delete the result files without the dev split from the results repo. |
@x-tabdeveloping the overwrite option is fine and I think we can go for it, but note that it'll only buy us some time: anyone who runs these models will produce result files under the 'external' revision, which is not desirable. Let's open an issue so that we would eventually rerun these models on MSMARCO with non-external revisions as well. How does that sound? |
How about we just remove the newer results on the problematic tasks from the results folder? Then we can rerun in the future if need be, and if people run the models now, they will get the correct revision. |
That sounds good. |
I've managed to find another pretty burning issue that we need to fix before launching the leaderboard: #1886 ArxivClusteringS2S.domains = None
AskUbuntuDupQuestions.domains = None
BIOSSES.domains = None
CQADupstackAndroidRetrieval.domains = None
CQADupstackEnglishRetrieval.domains = None
CQADupstackGamingRetrieval.domains = None
CQADupstackGisRetrieval.domains = None
CQADupstackMathematicaRetrieval.domains = None
CQADupstackPhysicsRetrieval.domains = None
CQADupstackStatsRetrieval.domains = None
CQADupstackTexRetrieval.domains = None
CQADupstackUnixRetrieval.domains = None
CQADupstackWebmastersRetrieval.domains = None
CQADupstackWordpressRetrieval.domains = None
ClimateFEVER.domains = None
FEVER.domains = None
FiQA2018.domains = None
NQ.domains = None
QuoraRetrieval.domains = None
RedditClustering.domains = None
RedditClusteringP2P.domains = None
STSBenchmark.domains = None
StackExchangeClustering.domains = None
StackExchangeClusteringP2P.domains = None
StackOverflowDupQuestions.domains = None
TwitterSemEval2015.domains = None
TwitterURLCorpus.domains = None
MSMARCO.domains = None |
@x-tabdeveloping I'll fill them out, I think I did some of them for the paper and forgot to add them to TaskMetadata, my bad 😅 |
I think main @imenelydiaker ! So that we can release the leaderboard |
@isaac-chung @Samoed I might be able to fix the issue with the results in code, I will update you about it. |
Okay, so I have fixed the cases where the results are present, but in the external results folder. |
I believe this is a bug, as the |
So, in conclusion, it is a bug with the old leaderboard, and the only way to go about fixing it is for us to run MSMARCO's dev split on these models. Is this a correct assessment? |
Unfortunately yes |
@x-tabdeveloping I can work on the banner, except if there is something more pressing I can help with (seems like other items are in the works). |
Sure thing @wissam-sib! By all means go ahead |
Cool, I've started here: #1908 |
Looks like we're down to the last vital issue before the release! |
@Muennighoff Can you help us out with it? Some models don't have MSMARCO results at all on the dev split, and we might need to run them. |
Yes will try to run them this weekend! Amazing work on everything 🚀🚀🚀 |
The leaderboard is getting really close to being ready. @x-tabdeveloping and I manually reviewed each leaderboard to compare and found a few remaining issues. They are generally an issue of specification differences between the benchmarks.py and the current v1 of the leaderboard. We also have a few missing results. Some of which Niklas @Muennighoff is rerunning, but others are newer model releases (<1 month old). For these, we have reached out to the authors to let them know about the changes. For the inconsistencies, we have asked the benchmark contact (e.g., @imenelydiaker for French) to points to clarify which version is desired. We are planning to do the release on Tuesday next week. There are a few missing scores and inconsistencies:
|
Since we would like to release the leaderboard as soon as possible (especially since the paper got accepted for ICLR), I would love to open a discussion about what we consider to be the minimum requirements for publishing the new leaderboard.
I highly doubt that we will be able to fix all issues right away, but we should, in any case focus in on a couple of them that are crucial for the new leaderboard to be in a releasable state.
Here are some of my criteria:
VITAL PROBLEMS:
MTEB(eng, classic)
Missing Metadata for tasks inMTEB(eng, classic)
#1886 (@imenelydiaker is working on this fix: Filling missing metadata for leaderboard release #1895 )I have tried implementing as many model metas the last couple of days as humanly possible, but this has been incredibly time consuming. If you still see models missing that you think should definitely be there, then feel free to comment here.
Nice to haves:
THIS IS JUST MY JUDGEMENT, PLEASE FEEL FREE TO ADD THINGS, I TOTALLY MIGHT BE MISSING SOMETHING
@Samoed @KennethEnevoldsen @Muennighoff @orionw @isaac-chung @imenelydiaker @tomaarsen
The text was updated successfully, but these errors were encountered: