Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve ingest for Andersen lab/SRA sequences #114

Merged
merged 5 commits into from
Dec 30, 2024

Conversation

joverlee521
Copy link
Contributor

@joverlee521 joverlee521 commented Dec 30, 2024

Description of proposed changes

Improve ingest for Andersen lab/SRA sequences to avoid unexpected strain name modifications in the phylogenetic workflow:

  1. Replace invalid characters in strain name
  2. Parse country as GenBank location format

Includes updates to geolocation rules, host, and annotations for data issues that I noticed during testing.

Related issue(s)

Resolves #113

Checklist

Replace invalid characters with `_` to match iqtree¹ so augur tree will
not modify strain names and cause a mismatch between the tree and the
alignment FASTA.²

Keeps the previous removal of whitespace in place to ensure that the
previous valid strain names do not change.

¹ <https://github.com/iqtree/iqtree2/blob/74da454bbd98d6ecb8cb955975a50de59785fbde/utils/tools.cpp#L607>
² <#113>
Use the augur.curate.parse_genbank_location function to parse the
country field since we've now seen an example of it formatted as
the GenBank geolocation format ("USA:OR") in the SRA data.¹

¹ <#113>
@joverlee521
Copy link
Contributor Author

Locally tested the NCBI ingest and diffed output with prod metadata.

There was a total of 446 record changes. Most were expected geolocation and host updates, with only 3 strain name changes:

  • A/PETFOOD/USA:OR/24-037325-013/2024 -> A/PETFOOD/USA/24-037325-013/2024
  • A/PETFOOD/USA:OR/24-037325-012/2024 -> A/PETFOOD/USA/24-037325-012/2024
  • A/PETFOOD/USA:OR/24-037325-011/2024 -> A/PETFOOD/USA/24-037325-011/2024

@joverlee521
Copy link
Contributor Author

I'm confident in these changes, so I'm going to merge and re-run the workflows. Post-merge reviews are always welcome!

@joverlee521 joverlee521 merged commit fe37067 into master Dec 30, 2024
14 checks passed
@joverlee521 joverlee521 deleted the improve-ingest-andersen branch December 30, 2024 22:05
joverlee521 added a commit that referenced this pull request Dec 31, 2024
@joverlee521 joverlee521 mentioned this pull request Dec 31, 2024
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Automated h5n1-cattle-outbreak phylo build failure
1 participant