Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Demo prefilter rule for Nextstrain GISAID build #1128

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

huddlej
Copy link
Contributor

@huddlej huddlej commented Jul 25, 2024

Description of proposed changes

Adds a prefilter rule to reduce the size of the input metadata for the GISAID build before running the whole workflow.

Related issue(s)

Related to #814

Testing

What steps should be taken to test the changes you've proposed?
If you added or changed behavior in the codebase, did you update the tests, or do you need help with this?

Release checklist

If this pull request introduces backward incompatible changes, complete the following steps for a new release of the workflow:

  • Determine the version number for the new release by incrementing the most recent release (e.g., "v2" from "v1").
  • Update docs/src/reference/change_log.md in this pull request to document these changes and the new version number.
  • After merging, create a new GitHub release with the new version number as the tag and release title.

If this pull request introduces new features, complete the following steps:

  • Update docs/src/reference/change_log.md in this pull request to document these changes by the date they were added.

@jameshadfield
Copy link
Member

Adds a prefilter rule to reduce the size of the input metadata for the GISAID build before running the whole workflow.

This seems to be part of a general theme of "we have too much data, how do we restrict it so our tooling is more performant?". As a PR on its own it seems simple and will improve performance (assuming you are happy that 500k is representative) and it seems fine to merge essentially as-is. Placing it into a wider context there's a few things that came to mind:

  • If we made this subsampling step part of ncov-ingest and provisioned 500k files (TSV + FASTA) that would speed up all n invocations of this workflow at the expense of a single additional step within ncov-ingest. It'd also reduce S3 egress and disk-space needed for the workflow-runner.
  • We're already provisioning 100k datasets. Is 100k not representative enough (cf. 500k), or is it that the 100k datasets are only regenerated weekly? It seems we could be using one or the other.
  • I continue to think the long-term solution is to have augur filter running (using a db) on a server such that augur filter calls in workflows can make an API call to the server and only download the subsetted data.

@huddlej
Copy link
Contributor Author

huddlej commented Nov 8, 2024

Thank you for following up on this, @jameshadfield! This got pushed off my todo list by other projects.

I think we could merge this, if I drop the change to the GISAID builds config and avoid changing how our regular builds work.

Otherwise, we'd want to test the specific prefilter number used here with GISAID and open builds before committing to this approach. I'm pretty out of touch with this repo these days, so I'd prefer to have someone who is keeping up with SC2 evaluate the quality of the prefiltered trees.

@victorlin
Copy link
Member

I brought this up during dev chat – it seemed like something I could pick up, but the next steps were unclear.


drop the change to the GISAID builds config and avoid changing how our regular builds work.

I got the impression that we want to go the other way: apply this to all profiles and not just GISAID.


we'd want to test the specific prefilter number used here with GISAID and open builds before committing to this approach

We discussed this. Notes extracted from dev chat doc:

Probably the 2m build target will be requesting the largest division / month counts that could be larger than 100k, note also that as time continues to accrue that 100k division / month will be increasingly sparse relative to 2m build target

Additional points:

  • Probably the 1m build target, not 2m, since it is a narrower window targeting the same amount of total sequences.
  • This can be investigated with augur filter --output-group-by-sizes which reports the amount of sequences targeted vs. number of sequences available per group. I think a good comparison will simply be diffing the --output-group-by-sizes file between current and downsampled runs (from the same input data) and making sure there are no differences.

@huddlej
Copy link
Contributor Author

huddlej commented Nov 12, 2024

I got the impression that we want to go the other way: apply this to all profiles and not just GISAID.

Yes, sorry I wasn't as clear as I should have been above. What I meant to say is that this PR should not be merged as it is because doing so would change only the GISAID profile without any verification that the number of subsampled sequences I picked actually work for our ncov builds.

If we want to merge this to keep the prefilter rule, I meant to propose that we drop the change to the GISAID profile and follow up later with the kind of verification of 1m, 2m GISAID and open builds you describe above, @victorlin.

If we want to do that verification in this PR, that's fine, too, but I don't have the bandwidth currently to take on that work. :-/

@victorlin
Copy link
Member

Thanks for clarifying! I think it'd be good to sort out in this PR. I don't see any reason to merge without changing the Nextstrain profiles since the added rule would be untested and unused.

I should have some bandwidth to take it on and will update here when I do, but others can feel free to jump in too.

huddlej and others added 2 commits November 15, 2024 10:19
Adds a prefilter rule to reduce the size of the input metadata for the
GISAID build before running the whole workflow.
This makes it easier to inspect the effect of a prefilter rule.
@victorlin victorlin force-pushed the demo-custom-prefilter-rule branch from 4919079 to ce86db2 Compare November 16, 2024 02:12
@victorlin
Copy link
Member

Rebased onto latest master. Running locally, I noticed the new rule prefilter resolved to 12 sequences per group (division, month) which might be too low especially for the 1m builds. I added ce86db2 to inspect what the individual builds are requesting per group in rule subsample and started GISAID trial builds both with prefilter and without prefilter. Once these are done, I will compare the outputs.

@victorlin
Copy link
Member

victorlin commented Nov 18, 2024

The run with prefilter failed because the root was filtered out. c8bb915 should fix it, but we can already check the output files.

Here is a diff of the --output-group-by-sizes file for the europe/1m's recent sample after sorting and prettifying. Left is current master, right is prefilter (this PR).

prefilter-diff

A "lossless" prefilter should result in no diff in the _augur_filter_subsampling_output_size column. Unfortunately, there are many. A breakdown of all diffs:

  • The off-by-one diffs in _augur_filter_target_size are due to probabilistic rounding and can be ignored.
  • The diffs in _augur_filter_input_size indicate a significant drop in sample availability for most groups.
  • The diffs in _augur_filter_subsampling_output_size indicate a significant increase in under-sampling. This is due to the decrease in _augur_filter_input_size to a value that is below (or was already below) _augur_filter_target_size.
  • Some groups (Portugal, 2024-42 and Slovenia, 2024-43) are missing entirely. This can be explained by a couple factors: (1) the grouping columns division month and (2) the small amount of 12 sequences per group resolved by requesting a maximum of 500k sequences. The grouping by month instead of week combined with the sparseness meant that some country/week groups were not sampled.

This needs more work to achieve a "lossless" prefilter. It's not easy* to do in a single augur filter call: for europe/1m's recent sample (left side of diff), it's apparent that we want all almost all the data available in the full metadata. The alternative is to use multiple augur filter calls to account for subsampling logic across all builds. This would be similar to rule subsample + rule combine_samples in the main workflow. I'm not sure if the added complexity is worth it.

* it should be possible with a carefully crafted --group-by-weights file, but that means a similar amount of added complexity as the approach which uses multiple augur filter calls.

@trvrb
Copy link
Member

trvrb commented Dec 5, 2024

Thank you for the clear demonstration here @victorlin. My guess that 500k sequences would be enough was way off. There might still be a worthwhile trade-off where we prefilter to something like 1M or 10M sequences during ncov-ingest and provide a 100k dataset, a 1M dataset and a 10M dataset along with the full dataset. Using the logic of --group-by "division year month" this would still cut out significantly more sequences from 2021 and 2022 and should leave significantly more sequences in 2023 and 2024. I bet we could get to a suitable tradeoff.

I do agree with @jameshadfield that the more generally useful implementation would be to push this to ncov-ingest and make the subsampled datasets be produced with every run of ncov-ingest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants