-
Notifications
You must be signed in to change notification settings - Fork 150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DRAFT] Quac Benchmark - ASET #333
base: main
Are you sure you want to change the base?
Conversation
f"Human F1 score {human_f1} is below threshold {MIN_HUMAN_F1}. Skipping evaluation." | ||
) | ||
return Score( | ||
value=1, # TODO: How can we skip current state instead of returning 1? If not we need to do some workaround... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can we skip the current state instead of returning f1 score of 1? If not I need to implement some workaround
if prediction.upper() == "CANNOTANSWER": | ||
return Score( | ||
value=CORRECT | ||
if any(t.upper() == "CANNOTANSWER" for t in targets) | ||
else INCORRECT, | ||
answer="CANNOTANSWER", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm currently explicitly checking for CANNOTANSWER. Maybe that can be removed / simplified though. Need to be tested
|
||
|
||
# TODO: Can we use _f1 from the core library instead? | ||
def f1_score(prediction: str, ground_truth: str) -> float: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use _f1
from the core library instead?
# TODO: Can we use _normalize from the core library and also add the following after: | ||
def remove_stopwords(text): | ||
return [word for word in text.split() if word not in STOP_WORDS] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use _normalize
from the core library and also add this on top in the custom scorer? We want this since in the paper they mention that they have removed stop words. (That being said, in one implementation I found they seem to remove only adjectives: https://github.com/my89/co-squac/blob/master/evals/squad2_eval.py#L45C1-L58C1 )
wikipedia_title = f"""Wikipedia Page Title: {record["wikipedia_page_title"]}""" | ||
background = f"""Background: {record["background"]}""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a bit unclear if wikipedia_title
and background
were also added in the original implementation of the paper
@@ -25,3 +25,4 @@ shortuuid | |||
tenacity | |||
typing_extensions>=4.9.0 | |||
zipp>=3.19.1 # not directly required, pinned by Snyk to avoid a vulnerability | |||
nltk==3.8.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added to filter out stop words as mentioned in the paper
benchmarks/quac/test_quac.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests might not work after adding the human f1 score (Also not sure if we should keep them here or at all).
return Task( | ||
dataset=dataset, | ||
scorer=f1_scorer(), | ||
config=GenerateConfig(temperature=0.5), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we set to 0 to make it deterministic, as done for the commonsense_qa benchmark?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cant see anything in the paper or the alternative implementation repo you linked above about the temperature. Perhaps make it deterministic and also expose it as a param for the user to set?
remove todos
@dragonstyle Could you review and work with @lauritowal on resolving the various questions and possible refinements? |
@dragonstyle A few questions in here about re-using pieces of the f1 scorer (or maybe the f1 scorer needs to get some additional flexibility to handle these cases?). Could you review and suggest recommended courses of action? |
@lauritowal thank you so much for your work on this! We are contemplating whether we should enhance our f1 scoring to better accommodate the paper/eval, suggest a scheme where you create a custom scorer that delegates to our f1 scorer, or possibly just export some of these internal functions for external use. We may not have time to sort all of this out in the very short term (this week or next). Would you be okay with our holding off merging this until we have a better idea how to handle this properly? |
It's alright to hold off on merging until this is clear. :) Then I can clean up the draft pull request.
Also happy to help if needed |
This PR contains:
What is the current behavior? (You can also link to an open issue here)
Implementation of Quac Benchmark
Does this PR introduce a breaking change? (What changes might users need to make in their application due to this PR?)
No
Other information:
Questions and TODOS:
Development decisions: