Rating methods comparison #33

amirrr · 2024-11-19T14:58:42Z

amirrr · 2024-11-19T15:03:54Z

gpt vs llama3	llama3 vs llama2	gpt vs llama2

joshnguyen99 · 2024-12-03T16:58:20Z

@amirrr @markwhiting Here's a comparison between humans and LLaMA-3.1 in labeling 6 dimensions for statements in our previous corpus.

Note that figure_of_speech is the negative class for literal language. If it were the positive class, the F-1 score would just be 0.48.

markwhiting · 2024-12-03T17:03:52Z

Thanks, do these look right to you? I note that we did have reversed classes for some features that we reverse in the start of the R script from the prior experiment, and just want to ensure that the ratings you're comparing with reflect the correct state.

joshnguyen99 · 2024-12-03T17:06:08Z

I think they should be correct. In the original data file (and R script), only everyday was coded as 1. Every other variable should be flipped. I flipped because the coded 1 value corresponds to the direction which has higher commonsensicality for humans.

markwhiting · 2024-12-03T17:13:58Z

OK, thanks. Qualitatively do you have any qualms with this?

Also, how much will this change the design points we can cover with our current corpus?

joshnguyen99 · 2024-12-03T17:24:46Z

The numbers are lower than I expected, although upon manual inspection some human-given labels may not be reasonable.

For example:

"We should ensure police have body cameras on them." -> Humans said fact, LLM said opinion
"The first thing you do when you have an asthma attack is use the inhaler." -> Humans said social, LLM said physical

I'm not sure how much the design points will have to change, but I think we should do a manual evaluation of LLaMA-3.1's labels in the new dataset. Perhaps we can sample 200 statements, label them ourselves independently (at least by me and Amir), and then check whether we agree with each other and with the LLM.

markwhiting · 2024-12-03T19:08:40Z

Thanks — yeah a manual pass seems worth it. At worst we will then have some ground truth to work with. If we want to scale that a bit more we can also put a more sophisticated version of the task in front of turkers.

joshnguyen99 · 2024-12-10T20:32:55Z

TODO:

Dataset for statements is here.
Select only statements for which published is True. These are statements that are on the platform already.
Randomly select 200 statements. Try multiple ways.
Coordinate labeling.
Prompt different LLMs for classification.

joshnguyen99 · 2024-12-20T21:43:01Z

@markwhiting @amirrr —

I have prompted LLaMA-3-8B to classify all N = 10,110 statements here.

The same features previously obtained by Amir are in this file. Note that they are only available for the first N = 8,814 statements.

For comparison, I have put them together. For feature X, X_old is Amir's version, while X is mine. Note that X_old is only available for the first 8,814 rows.

statements_features_comparison.csv

joshnguyen99 · 2024-12-20T21:50:19Z

Here are 200 statements randomly sampled from those for which published is True. I will label them myself and hope to get @amirrr's ratings soon as well.

to_label_200.csv

amirrr · 2024-12-20T22:20:54Z

Here are 200 statements randomly sampled from those for which published is True. I will label them myself and hope to get @amirrr's ratings soon as well.

to_label_200.csv

Will start working on them now.

joshnguyen99 · 2025-02-05T21:13:02Z

@markwhiting @amirrr

This is for the 200-statement dataset that Amir and I annotated independently.

I used LLaMA-3.1-8B to make predictions for the 6 dimensions. The script can be found here.

Essentially I asked LLaMA to output either of the two tokens: 0 and 1. Then I performed a softmax transformation on the logit scores of only these two tokens to get the associated probabilities.

The figure attached here shows the ROC curves for these scores. There are two settings: when using my labels as the ground truth (black), and when using Amir's labels (blue).

joshnguyen99 · 2025-02-05T21:17:34Z

I performed the same analysis for the original 4,407-statement dataset. Here I treat the mode of human annotations as the ground truth, similar to what we've done so far.

I also use the mean of these annotations. For example, if a statement gets 5 annotations, 3 of which say "physical" and 2 say "social", then the human score for "physical" is 3 / 5 = 60%. The figure below shows the correlation between LLaMA scores and human scores. I use the Spearman correlation here, and all of them are statistically significant.

markwhiting · 2025-02-05T21:20:28Z

Interesting. Thanks for this. It feels like we still don't have something that is strongly aligned, but I wonder what the best next step is for getting to a place where we are satisfied.

markwhiting · 2025-02-05T21:20:50Z

Would you mind doing the average analysis on the 200 that you two rated?

joshnguyen99 · 2025-02-05T21:29:24Z

Sure. Since there are only two annotators, the "human scores" are too coarse to be treated as continuous. Therefore, I determine a statement to be positive if at least Amir or I think it's positive, effectively doing an OR operation.

The ROC curves are as follows.

joshnguyen99 · 2025-02-05T21:33:57Z

I would say the following categories are pretty robust: fact, physical and literal language. Perhaps everyday as well.

The hardest for me to annotate were:

positive/normative: because some statements do not contain directive words like "should", "must" and "ought to", but could the normativity of them could still be inferred.

-knowledge/reasoning: similarly, should we only be relying on connectives like therefore, thus or hence?

markwhiting · 2025-02-06T01:33:47Z

I think even the resolution of three possible values 0, .5 or 1, I'm hoping that we see improved alignment over the binary case, because I think that would let us argue that the continuous categories from the model are sufficiently more informative that it's worth treating those as our standard.

So I'd still be interested to look at that if you can easily run it.

Another option would be to take these 200 statements and a few rounds of human raters on them to see if we get more consistent responses that way too.

Lastly, I really don't mind if we want to try to improve our definitions of these concepts as I think that will benefit us as well as the models.

joshnguyen99 · 2025-02-06T15:36:38Z

Sure, here it is!

I agree with you. I think we can all improve on our ratings in the next iteration, with a different set of statements. I certainly would.

markwhiting · 2025-02-06T16:09:34Z

Thank you.

Interesting that the model sometimes captures disagreement nicely and sometimes does not. Also interesting that LLaMA's probability for some is skewed so high. I wonder if we should post tune those to have better coverage and if that would help at all.

positive seems like the hardest one here because neither direction of tuning will help us.

Is this after a revisit to the ratings (I think @amirrr was planning to look at positive again?)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rating methods comparison #33

Rating methods comparison #33

amirrr commented Nov 19, 2024

amirrr commented Nov 19, 2024

joshnguyen99 commented Dec 3, 2024

markwhiting commented Dec 3, 2024

joshnguyen99 commented Dec 3, 2024 •

edited

Loading

markwhiting commented Dec 3, 2024

joshnguyen99 commented Dec 3, 2024

markwhiting commented Dec 3, 2024 via email

joshnguyen99 commented Dec 10, 2024

joshnguyen99 commented Dec 20, 2024 •

edited

Loading

joshnguyen99 commented Dec 20, 2024

amirrr commented Dec 20, 2024

joshnguyen99 commented Feb 5, 2025

joshnguyen99 commented Feb 5, 2025

markwhiting commented Feb 5, 2025

markwhiting commented Feb 5, 2025

joshnguyen99 commented Feb 5, 2025

joshnguyen99 commented Feb 5, 2025 •

edited

Loading

markwhiting commented Feb 6, 2025 •

edited

Loading

joshnguyen99 commented Feb 6, 2025

markwhiting commented Feb 6, 2025

Rating methods comparison #33

Rating methods comparison #33

Comments

amirrr commented Nov 19, 2024

amirrr commented Nov 19, 2024

joshnguyen99 commented Dec 3, 2024

markwhiting commented Dec 3, 2024

joshnguyen99 commented Dec 3, 2024 • edited Loading

markwhiting commented Dec 3, 2024

joshnguyen99 commented Dec 3, 2024

markwhiting commented Dec 3, 2024 via email

joshnguyen99 commented Dec 10, 2024

joshnguyen99 commented Dec 20, 2024 • edited Loading

joshnguyen99 commented Dec 20, 2024

amirrr commented Dec 20, 2024

joshnguyen99 commented Feb 5, 2025

joshnguyen99 commented Feb 5, 2025

markwhiting commented Feb 5, 2025

markwhiting commented Feb 5, 2025

joshnguyen99 commented Feb 5, 2025

joshnguyen99 commented Feb 5, 2025 • edited Loading

markwhiting commented Feb 6, 2025 • edited Loading

joshnguyen99 commented Feb 6, 2025

markwhiting commented Feb 6, 2025

joshnguyen99 commented Dec 3, 2024 •

edited

Loading

joshnguyen99 commented Dec 20, 2024 •

edited

Loading

joshnguyen99 commented Feb 5, 2025 •

edited

Loading

markwhiting commented Feb 6, 2025 •

edited

Loading