Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feedback as a guinea pig #1

Open
lbeyers opened this issue Aug 28, 2024 · 0 comments
Open

Feedback as a guinea pig #1

lbeyers opened this issue Aug 28, 2024 · 0 comments

Comments

@lbeyers
Copy link
Collaborator

lbeyers commented Aug 28, 2024

Not an issue, just feedback as someone who has never worked with this type of data before.

I find the notebook easy to follow given the explanation I received. It takes some time to get into it, and I added comments for myself locally, (to be specific, in the "filtering on precursor mass" section I noted that the lowest-index beam pred within tolerance is chosen, which makes sense specifically because confidence decreases with an increasing beam index - i.e. it is not an arbitrary choice to break out of the filtering loop when the first option meets tolerance) but nothing I couldn't work out easily enough.

As a side investigation, I had a look at the limits of the AUC scores both for the train set and the test set. Essentially I wanted to answer the question: with perfect selection of predictions from those available in the pred_beam_i columns, how large can my AUC be?

On the train set, I found a max AUC of 0.715 - so that's the highest AUC we can get without adding new prediction options into the dataset. Using Reference.csv as the ground truth, I found that the highest AUC we can get on the test set is 0.419. This leaves a max improvement of 0.065 in the train set and 0.093 in the test set if only filtering is used!

If I used the right methods to get there, I thought it might be useful to know since it limits the "legal" values that participants should be getting if they don't supplement the data. As a hackathon facilitator, I may lean towards suggesting data supplementation as a strategy with more potential than filtering, and encourage a deep dive into Prosit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant