Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use weighted sampling for other builds #1151

Merged
merged 3 commits into from
Sep 30, 2024

Conversation

victorlin
Copy link
Member

@victorlin victorlin commented Aug 26, 2024

Description of proposed changes

Extend the weighted sampling approach from Asia builds to all other builds.

Trial build links

Related issue(s)

Closes #1141

Checklist

  • Run trial builds
  • Merge base PR Fix Asia weighted sampling #1150
  • Update docs/src/reference/change_log.md in this pull request to document these changes by the date they were added.

@victorlin victorlin self-assigned this Aug 26, 2024
This was referenced Aug 26, 2024
@victorlin victorlin changed the title Use weighted sampling other builds Use weighted sampling for other builds Aug 27, 2024
@victorlin victorlin marked this pull request as ready for review August 27, 2024 00:39
@victorlin victorlin requested a review from trvrb August 30, 2024 18:37
Base automatically changed from victorlin/fix-asia-weighted-sampling to master September 26, 2024 23:24
@trvrb
Copy link
Member

trvrb commented Sep 26, 2024

This is great @victorlin! Please go ahead and plan to rebase and merge whenever you'd like. I did want to note for the future that the switch to country-level only for North America has resulted in US states with more samples getting over-represented relative to population size. Here's trail build: https://nextstrain.org/staging/ncov/gisaid/trial/victorlin-all-builds-weighted/north-america/6m?f_region=North%20America

Screenshot 2024-09-26 at 4 32 31 PM

Here's the comparison to previous behavior:

Screenshot 2024-09-26 at 4 32 59 PM

Overall, I think it's a big improvement to have country like Costa Rica with many admin divisions get down-weighted to balance its population size.

We could consider implementing division population weights at some point in the future, but I wouldn't worry too much about it immediately. As it stands, states with large population sizes like CA, NY and TX are tending to submit more sequences anyway so the resulting sampling maybe more representative anyway.

Extend the weighted sampling approach from Asia builds to other regional
builds. This comes with the added benefit of reducing redundancy in
subsampling schemes.
Extend the weighted sampling approach from regional builds to global
builds. This comes with the added benefit of simplifying logic to avoid
region/country-specific max_sequences.
@victorlin victorlin force-pushed the victorlin/use-weighted-sampling-other-builds branch from d5c9d19 to 7d9f31c Compare September 30, 2024 18:06
@victorlin victorlin merged commit 8399290 into master Sep 30, 2024
7 checks passed
@victorlin victorlin deleted the victorlin/use-weighted-sampling-other-builds branch September 30, 2024 18:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use weighted sampling
2 participants