Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DRAFT] Quac Benchmark - ASET #333

Draft
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

lauritowal
Copy link
Contributor

@lauritowal lauritowal commented Sep 5, 2024

This PR contains:

  • New features
  • Changes to dev-tools e.g. CI config / github tooling
  • Docs
  • Bug fixes
  • Code refactor

What is the current behavior? (You can also link to an open issue here)

Implementation of Quac Benchmark

Does this PR introduce a breaking change? (What changes might users need to make in their application due to this PR?)

No

Other information:

Questions and TODOS:

  • Can we skip current state in a custom scorer easily? If not I need to create a workaround or implement that in the core library
  • Evaluation of the full dataset is still missing because of the previous point
  • Add benchmark to benchmarks/README.md

Development decisions:

  • I'm only evaluating on the last question instead of iterating over all questions in the text. This is a simplification of course, but maybe enough (?)
  • I haven't implemented the Human Equivalence Score for questions (HEQ-Q) and for dialogs (HEQ-D) reported in the paper. However, it seems like other implementations don't have done that either (GPT-3 Paper, and LLama 3 Paper)
  • I implemented my custom f1 scorer for now, since some custom code was needed (e.g. skipping of states in scorer, return f1 score of 1 for correctly returning "CANNOTANSWER"...). However, a few things can probably be reused and refactored.

f"Human F1 score {human_f1} is below threshold {MIN_HUMAN_F1}. Skipping evaluation."
)
return Score(
value=1, # TODO: How can we skip current state instead of returning 1? If not we need to do some workaround...
Copy link
Contributor Author

@lauritowal lauritowal Sep 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can we skip the current state instead of returning f1 score of 1? If not I need to implement some workaround

Comment on lines +28 to +34
if prediction.upper() == "CANNOTANSWER":
return Score(
value=CORRECT
if any(t.upper() == "CANNOTANSWER" for t in targets)
else INCORRECT,
answer="CANNOTANSWER",
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm currently explicitly checking for CANNOTANSWER. Maybe that can be removed / simplified though. Need to be tested



# TODO: Can we use _f1 from the core library instead?
def f1_score(prediction: str, ground_truth: str) -> float:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use _f1 from the core library instead?

Comment on lines +96 to +98
# TODO: Can we use _normalize from the core library and also add the following after:
def remove_stopwords(text):
return [word for word in text.split() if word not in STOP_WORDS]
Copy link
Contributor Author

@lauritowal lauritowal Sep 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use _normalize from the core library and also add this on top in the custom scorer? We want this since in the paper they mention that they have removed stop words. (That being said, in one implementation I found they seem to remove only adjectives: https://github.com/my89/co-squac/blob/master/evals/squad2_eval.py#L45C1-L58C1 )

Comment on lines +34 to +35
wikipedia_title = f"""Wikipedia Page Title: {record["wikipedia_page_title"]}"""
background = f"""Background: {record["background"]}"""
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a bit unclear if wikipedia_title and background were also added in the original implementation of the paper

@@ -25,3 +25,4 @@ shortuuid
tenacity
typing_extensions>=4.9.0
zipp>=3.19.1 # not directly required, pinned by Snyk to avoid a vulnerability
nltk==3.8.1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added to filter out stop words as mentioned in the paper

Copy link
Contributor Author

@lauritowal lauritowal Sep 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests might not work after adding the human f1 score (Also not sure if we should keep them here or at all).

return Task(
dataset=dataset,
scorer=f1_scorer(),
config=GenerateConfig(temperature=0.5),
Copy link
Contributor Author

@lauritowal lauritowal Sep 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we set to 0 to make it deterministic, as done for the commonsense_qa benchmark?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cant see anything in the paper or the alternative implementation repo you linked above about the temperature. Perhaps make it deterministic and also expose it as a param for the user to set?

remove todos
@jjallaire-aisi
Copy link
Collaborator

@dragonstyle Could you review and work with @lauritowal on resolving the various questions and possible refinements?

@jjallaire-aisi
Copy link
Collaborator

@dragonstyle A few questions in here about re-using pieces of the f1 scorer (or maybe the f1 scorer needs to get some additional flexibility to handle these cases?). Could you review and suggest recommended courses of action?

@jjallaire
Copy link
Collaborator

@lauritowal thank you so much for your work on this! We are contemplating whether we should enhance our f1 scoring to better accommodate the paper/eval, suggest a scheme where you create a custom scorer that delegates to our f1 scorer, or possibly just export some of these internal functions for external use.

We may not have time to sort all of this out in the very short term (this week or next). Would you be okay with our holding off merging this until we have a better idea how to handle this properly?

@lauritowal
Copy link
Contributor Author

lauritowal commented Sep 9, 2024

@jjallaire

We may not have time to sort all of this out in the very short term (this week or next). Would you be okay with our holding off merging this until we have a better idea how to handle this properly?

It's alright to hold off on merging until this is clear. :) Then I can clean up the draft pull request.

We are contemplating whether we should enhance our f1 scoring to better accommodate the paper/eval, suggest a scheme where you create a custom scorer that delegates to our f1 scorer, or possibly just export some of these internal functions for external use.

Also happy to help if needed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants