-
Notifications
You must be signed in to change notification settings - Fork 403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Demo prefilter rule for Nextstrain GISAID build #1128
base: master
Are you sure you want to change the base?
Conversation
This seems to be part of a general theme of "we have too much data, how do we restrict it so our tooling is more performant?". As a PR on its own it seems simple and will improve performance (assuming you are happy that 500k is representative) and it seems fine to merge essentially as-is. Placing it into a wider context there's a few things that came to mind:
|
Thank you for following up on this, @jameshadfield! This got pushed off my todo list by other projects. I think we could merge this, if I drop the change to the GISAID builds config and avoid changing how our regular builds work. Otherwise, we'd want to test the specific prefilter number used here with GISAID and open builds before committing to this approach. I'm pretty out of touch with this repo these days, so I'd prefer to have someone who is keeping up with SC2 evaluate the quality of the prefiltered trees. |
I brought this up during dev chat – it seemed like something I could pick up, but the next steps were unclear.
I got the impression that we want to go the other way: apply this to all profiles and not just GISAID.
We discussed this. Notes extracted from dev chat doc:
Additional points:
|
Yes, sorry I wasn't as clear as I should have been above. What I meant to say is that this PR should not be merged as it is because doing so would change only the GISAID profile without any verification that the number of subsampled sequences I picked actually work for our ncov builds. If we want to merge this to keep the prefilter rule, I meant to propose that we drop the change to the GISAID profile and follow up later with the kind of verification of 1m, 2m GISAID and open builds you describe above, @victorlin. If we want to do that verification in this PR, that's fine, too, but I don't have the bandwidth currently to take on that work. :-/ |
Thanks for clarifying! I think it'd be good to sort out in this PR. I don't see any reason to merge without changing the Nextstrain profiles since the added rule would be untested and unused. I should have some bandwidth to take it on and will update here when I do, but others can feel free to jump in too. |
Adds a prefilter rule to reduce the size of the input metadata for the GISAID build before running the whole workflow.
This makes it easier to inspect the effect of a prefilter rule.
4919079
to
ce86db2
Compare
Rebased onto latest master. Running locally, I noticed the new |
The run with prefilter failed because the root was filtered out. c8bb915 should fix it, but we can already check the output files. Here is a diff of the A "lossless" prefilter should result in no diff in the
This needs more work to achieve a "lossless" prefilter. It's not easy* to do in a single * it should be possible with a carefully crafted |
Thank you for the clear demonstration here @victorlin. My guess that 500k sequences would be enough was way off. There might still be a worthwhile trade-off where we prefilter to something like 1M or 10M sequences during I do agree with @jameshadfield that the more generally useful implementation would be to push this to |
Description of proposed changes
Adds a prefilter rule to reduce the size of the input metadata for the GISAID build before running the whole workflow.
Related issue(s)
Related to #814
Testing
What steps should be taken to test the changes you've proposed?
If you added or changed behavior in the codebase, did you update the tests, or do you need help with this?
Release checklist
If this pull request introduces backward incompatible changes, complete the following steps for a new release of the workflow:
docs/src/reference/change_log.md
in this pull request to document these changes and the new version number.If this pull request introduces new features, complete the following steps:
docs/src/reference/change_log.md
in this pull request to document these changes by the date they were added.