Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Ingest from NCBI Datasets + Entrez #39

Closed
wants to merge 7 commits into from

Conversation

joverlee521
Copy link
Contributor

Description of proposed changes

Documenting my WIP for ingesting data from NCBI Datasets + Entrez in case we ever need it again. I was going down this route when NCBI Virus downloads were broken for H5N1, so I purposely avoided touching the NCBI Virus vvsearch2 API in this work.

The NCBI Virus vvsearch2 API seems more straight-forward, so I will be pivoting to focus on that route in a separate PR.

Datasets + Entrez DAG NCBI Virus + Datasets DAG
dag dag

Usage

The workflow can be run with

nextstrain build ingest ingest_ncbi --configfile build-configs/ncbi/defaults/config.yaml

The workflow runs in ~14 minutes, with 10 minutes of it being spent on fetching data with Entrez.
It produces a single metadata.tsv and sequences.fasta that includes many serotypes and all segments.

Outstanding tasks if I want to pick up this workflow again

  • Filter down to serotype = H5N1 records
  • Filter down to collection dates after 2023-12-31
  • Transform numeric segment values to segment names
  • Format final output to be 1 metadata.tsv + 8 segment sequence.fasta files

Related issue(s)

Related to #37

Prepping for starting the ingest workflow for NCBI data.
Copying over the ingest workflow from the pathgoen-repo-guide¹ as a
"custom" build for the ingest workflow because the default ingest
for avian-flu uses fauna.

I will make changes to adopt the workflow to this repo in subsequent
commits.

¹ <https://github.com/nextstrain/pathogen-repo-guide/tree/f33c43edd9ebad10aa0e8d2b0791755ddbe2f5c8/ingest>
1. Remove `ingest/build-configs/nbci/vendored` -- We will be using the
shared `ingest/vendored` scripts that already exist in this repo.
No need to vendored inception!

2. Remove `ingest/build-configs/ncbi/build-configs` -- the NCBI workflow
itself is already a "custom" build. No need for build-configs inception!

3. Remove files and rules related to Nextclade. We can add them back
later if we want to integrate Nextclade outputs with the metadata.

4. Remove extra geolocation_rules.tsv, just use the shared
ingest/defaults/geolocation_rules.tsv since these fixes should be
data source agnostic.

5. Remove boilerplate rules related to fetching from NCBI Entrez.

6. Remove boilerplate text from pathogen-repo-guide.
Use the `custom_rules` to include the NCBI ingest Snakefile with
the main workflow. Include file paths updates to namespace the
NCBI ingest outputs under `ncbi/`.
Using NCBI taxon id `11320` for Influenza A viruses to include all
GenBank records that are then further labeled with serotypes (e.g. H5N1).

Filtering down with the `--released-after` and `--geo-location`
options for NCBI datasets to get a smaller subset of data that centered
around the H5N1 outbreak in the U.S. Based on the filter URL for NCBI
virus that was shared by @trvrb:

<https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType_s=Nucleotide&EnvSample_s=include&VirusLineage_ss=Alphainfluenzavirus,%20taxid:197911&Serotype_s=H5N1%20H5N*&CollectionDate_dr=2023-12-31T00:00:00.00Z%20TO%202024-05-08T23:59:59.00Z&Region_s=North%20America>
Adds script to fetch GenBank records with Entrez for the list of
accessions that we've downloaded from NCBI Datasets. This pulls out
extra metadata that are currently not included in the NCBI Datasets,
i.e. strain, serotype, and segment. The extra metadata is then joined
with the usual NCBI Datasets metadata using `tsv-join`.

Subsequent commits will make changes the curation pipeline to account
for the extra metadata.
Include the additional columns that are joined from Entrez.
Give precedence to the `strain`` field from Entrez and fill in with
back up values from the Datasets `isolate` field.
@joverlee521 joverlee521 deleted the ingest-ncbi-entrez branch May 28, 2024 17:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant