[DRAFT] Quac Benchmark - ASET #333

lauritowal · 2024-09-05T22:10:37Z

This PR contains:

What is the current behavior? (You can also link to an open issue here)

Implementation of Quac Benchmark

Does this PR introduce a breaking change? (What changes might users need to make in their application due to this PR?)

No

Other information:

Questions and TODOS:

Can we skip current state in a custom scorer easily? If not I need to create a workaround or implement that in the core library
Evaluation of the full dataset is still missing because of the previous point
Add benchmark to benchmarks/README.md

Development decisions:

I'm only evaluating on the last question instead of iterating over all questions in the text. This is a simplification of course, but maybe enough (?)
I haven't implemented the Human Equivalence Score for questions (HEQ-Q) and for dialogs (HEQ-D) reported in the paper. However, it seems like other implementations don't have done that either (GPT-3 Paper, and LLama 3 Paper)
I implemented my custom f1 scorer for now, since some custom code was needed (e.g. skipping of states in scorer, return f1 score of 1 for correctly returning "CANNOTANSWER"...). However, a few things can probably be reused and refactored.

lauritowal · 2024-09-05T22:12:23Z

benchmarks/quac/f1.py

+                    f"Human F1 score {human_f1} is below threshold {MIN_HUMAN_F1}. Skipping evaluation."
+                )
+                return Score(
+                    value=1,  # TODO: How can we skip current state instead of returning 1? If not we need to do some workaround...


How can we skip the current state instead of returning f1 score of 1? If not I need to implement some workaround

lauritowal · 2024-09-05T22:13:18Z

benchmarks/quac/f1.py

+        if prediction.upper() == "CANNOTANSWER":
+            return Score(
+                value=CORRECT
+                if any(t.upper() == "CANNOTANSWER" for t in targets)
+                else INCORRECT,
+                answer="CANNOTANSWER",
+            )


I'm currently explicitly checking for CANNOTANSWER. Maybe that can be removed / simplified though. Need to be tested

lauritowal · 2024-09-05T22:14:12Z

benchmarks/quac/f1.py

+
+
+# TODO: Can we use _f1 from the core library instead?
+def f1_score(prediction: str, ground_truth: str) -> float:


Can we use _f1 from the core library instead?

lauritowal · 2024-09-05T22:14:44Z

benchmarks/quac/f1.py

+    # TODO: Can we use _normalize from the core library and also add the following after:
+    def remove_stopwords(text):
+        return [word for word in text.split() if word not in STOP_WORDS]


Can we use _normalize from the core library and also add this on top in the custom scorer? We want this since in the paper they mention that they have removed stop words. (That being said, in one implementation I found they seem to remove only adjectives: https://github.com/my89/co-squac/blob/master/evals/squad2_eval.py#L45C1-L58C1 )

lauritowal · 2024-09-05T22:17:34Z

benchmarks/quac/quac.py

+    wikipedia_title = f"""Wikipedia Page Title: {record["wikipedia_page_title"]}"""
+    background = f"""Background: {record["background"]}"""


a bit unclear if wikipedia_title and background were also added in the original implementation of the paper

lauritowal · 2024-09-05T22:19:45Z

requirements.txt

@@ -25,3 +25,4 @@ shortuuid
 tenacity
 typing_extensions>=4.9.0
 zipp>=3.19.1 # not directly required, pinned by Snyk to avoid a vulnerability
+nltk==3.8.1


added to filter out stop words as mentioned in the paper

lauritowal · 2024-09-06T08:24:09Z

benchmarks/quac/test_quac.py

Tests might not work after adding the human f1 score (Also not sure if we should keep them here or at all).

lauritowal · 2024-09-06T08:25:05Z

benchmarks/quac/quac.py

+    return Task(
+        dataset=dataset,
+        scorer=f1_scorer(),
+        config=GenerateConfig(temperature=0.5),


should we set to 0 to make it deterministic, as done for the commonsense_qa benchmark?

I cant see anything in the paper or the alternative implementation repo you linked above about the temperature. Perhaps make it deterministic and also expose it as a param for the user to set?

remove todos

jjallaire-aisi · 2024-09-06T11:15:11Z

@dragonstyle Could you review and work with @lauritowal on resolving the various questions and possible refinements?

jjallaire-aisi · 2024-09-07T11:07:33Z

@dragonstyle A few questions in here about re-using pieces of the f1 scorer (or maybe the f1 scorer needs to get some additional flexibility to handle these cases?). Could you review and suggest recommended courses of action?

jjallaire · 2024-09-09T12:51:35Z

@lauritowal thank you so much for your work on this! We are contemplating whether we should enhance our f1 scoring to better accommodate the paper/eval, suggest a scheme where you create a custom scorer that delegates to our f1 scorer, or possibly just export some of these internal functions for external use.

We may not have time to sort all of this out in the very short term (this week or next). Would you be okay with our holding off merging this until we have a better idea how to handle this properly?

lauritowal · 2024-09-09T13:00:07Z

@jjallaire

We may not have time to sort all of this out in the very short term (this week or next). Would you be okay with our holding off merging this until we have a better idea how to handle this properly?

It's alright to hold off on merging until this is clear. :) Then I can clean up the draft pull request.

We are contemplating whether we should enhance our f1 scoring to better accommodate the paper/eval, suggest a scheme where you create a custom scorer that delegates to our f1 scorer, or possibly just export some of these internal functions for external use.

Also happy to help if needed

lauritowal and others added 13 commits September 1, 2024 12:55

experimenting with dataset

3e7093f

format dataset for inspect

feb38dc

add comments

49acfa8

fix joint target answers

f9fc742

use f1 scorer from drop benchmark

b67ef05

add comment from paper about evaluation

cb5f937

add totods

21490ff

make custom f1 implementation more similar to paper description

d542385

add test for quac

7925a66

add human f1 score aiming to filter out states under 0.4

dc7a6d0

add Readme + formatting

29472f0

add original name to readme

0979951

remove test folder for now

905dc0a

lauritowal commented Sep 5, 2024

View reviewed changes

lauritowal added 4 commits September 6, 2024 00:21

remove unused file

8d385a6

cleanup ruff

f0c81f0

remove wrong import

8b5d39c

add docs

56a997b

lauritowal commented Sep 6, 2024

View reviewed changes

Update README.md

9b798fa

remove todos

jjallaire-aisi requested a review from dragonstyle September 7, 2024 11:07

jjallaire-aisi added the benchmark label Sep 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DRAFT] Quac Benchmark - ASET #333

[DRAFT] Quac Benchmark - ASET #333

lauritowal commented Sep 5, 2024 •

edited

Loading

lauritowal Sep 5, 2024 •

edited

Loading

lauritowal Sep 5, 2024

lauritowal Sep 5, 2024

lauritowal Sep 5, 2024 •

edited

Loading

lauritowal Sep 5, 2024

lauritowal Sep 5, 2024

lauritowal Sep 6, 2024 •

edited

Loading

lauritowal Sep 6, 2024 •

edited

Loading

NelsonG-C Sep 8, 2024

jjallaire-aisi commented Sep 6, 2024

jjallaire-aisi commented Sep 7, 2024

jjallaire commented Sep 9, 2024

lauritowal commented Sep 9, 2024 •

edited

Loading



		# TODO: Can we use _f1 from the core library instead?
		def f1_score(prediction: str, ground_truth: str) -> float:

		wikipedia_title = f"""Wikipedia Page Title: {record["wikipedia_page_title"]}"""
		background = f"""Background: {record["background"]}"""

[DRAFT] Quac Benchmark - ASET #333

Are you sure you want to change the base?

[DRAFT] Quac Benchmark - ASET #333

Conversation

lauritowal commented Sep 5, 2024 • edited Loading

This PR contains:

What is the current behavior? (You can also link to an open issue here)

Does this PR introduce a breaking change? (What changes might users need to make in their application due to this PR?)

Other information:

Questions and TODOS:

Development decisions:

lauritowal Sep 5, 2024 • edited Loading

Choose a reason for hiding this comment

lauritowal Sep 5, 2024

Choose a reason for hiding this comment

lauritowal Sep 5, 2024

Choose a reason for hiding this comment

lauritowal Sep 5, 2024 • edited Loading

Choose a reason for hiding this comment

lauritowal Sep 5, 2024

Choose a reason for hiding this comment

lauritowal Sep 5, 2024

Choose a reason for hiding this comment

lauritowal Sep 6, 2024 • edited Loading

Choose a reason for hiding this comment

lauritowal Sep 6, 2024 • edited Loading

Choose a reason for hiding this comment

NelsonG-C Sep 8, 2024

Choose a reason for hiding this comment

jjallaire-aisi commented Sep 6, 2024

jjallaire-aisi commented Sep 7, 2024

jjallaire commented Sep 9, 2024

lauritowal commented Sep 9, 2024 • edited Loading

lauritowal commented Sep 5, 2024 •

edited

Loading

lauritowal Sep 5, 2024 •

edited

Loading

lauritowal Sep 5, 2024 •

edited

Loading

lauritowal Sep 6, 2024 •

edited

Loading

lauritowal Sep 6, 2024 •

edited

Loading

lauritowal commented Sep 9, 2024 •

edited

Loading