Skip to content

Commit

Permalink
Merge pull request #50 from nextstrain/ingest-join-ncbi-andersen
Browse files Browse the repository at this point in the history
ingest: Join NCBI GenBank and Andersen Lab data
  • Loading branch information
joverlee521 authored Jun 7, 2024
2 parents a561b86 + e4a25ba commit 047a3a2
Show file tree
Hide file tree
Showing 13 changed files with 184 additions and 17 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ benchmarks/
ingest/fauna/
ingest/andersen-lab/
ingest/ncbi/
ingest/joined-ncbi/

# Sensitive environment variables
environment*
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@ nextstrain build \
--env AWS_SECRET_ACCESS_KEY \
. \
--snakefile Snakefile.genome \
--config s3_src=s3://nextstrain-data/files/workflows/avian-flu/h5n1/ncbi
--config s3_src=s3://nextstrain-data/files/workflows/avian-flu/h5n1
```

Currently this is only set up for the "h5n1-cattle-outbreak" build using NCBI data,
Expand Down
6 changes: 4 additions & 2 deletions Snakefile.genome
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
include: "rules/common.smk"

SUPPORTED_LOCAL_SOURCES = ["ncbi", "andersen-lab", "joined-ncbi"]

if LOCAL_INGEST:
assert INGEST_SOURCE == "ncbi", \
"Full genome build is only set up for locat ingest from 'ncbi'."
assert INGEST_SOURCE in SUPPORTED_LOCAL_SOURCES, \
f"Full genome build is only set up for locat ingest from {SUPPORTED_LOCAL_SOURCES}."
else:
assert S3_SRC.startswith("s3://nextstrain-data/"), \
"Full genome build is only set up for data from the public S3 bucket"
Expand Down
8 changes: 7 additions & 1 deletion config/auspice_config_h5n1-cattle-outbreak.json
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,11 @@
"key": "submitting_lab",
"title": "Submitting Lab",
"type": "categorical"
},
{
"key": "data_source",
"title": "Data Source",
"type": "categorical"
}
],
"geo_resolutions": [
Expand All @@ -123,6 +128,7 @@
"submitting_lab"
],
"metadata_columns": [
"genbank_accession"
"genbank_accession",
"sra_accessions"
]
}
3 changes: 3 additions & 0 deletions config/description_h5n1-cattle-outbreak.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
We gratefully acknowledge the authors, originating and submitting laboratories of the genetic sequences and metadata for sharing their work. Please note that although data generators have generously shared data in an open fashion, that does not mean there should be free license to publish on this data. Data generators should be cited where possible and collaborations should be sought in some circumstances. Please try to avoid scooping someone else's work. Reach out if uncertain.

Genomic data from the ongoing outbreak of H5N1 in cattle in the US was shared by the [National Veterinary Services Laboratories (NVSL)](https://www.aphis.usda.gov/labs/about-nvsl) of the [Animal and Plant Health Inspection Service (APHIS)](https://www.aphis.usda.gov/) of the U.S. Department of Agriculture (USDA) in an open fashion to NCBI GenBank.

NCBI GenBank data is supplemented with publicly available consensus sequences and metadata
from Andersen lab's [avian-influenza repo](https://github.com/andersen-lab/avian-influenza).
12 changes: 11 additions & 1 deletion config/dropped_strains_h5n1-cattle-outbreak.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,14 @@
# Duplicated entries from Andersen Lab
A/Cattle/USA/24-008749-004-v/2024 # Duplicate of A/cattle/Texas/24-008749-004/2024
A/Cattle/USA/24-008749-006-original/2024 # Duplicate of A/cattle/Texas/24-008749-006-original/2024
A/Cattle/USA/24-008749-006-v/2024 # Duplicate of A/cattle/Texas/24-008749-006-original/2024
A/Cattle/USA/24-008766-001-original/2024 # Duplicate of A/cattle/Kansas/24-008766-001/2024
A/Cattle/USA/24-008766-001-v/2024 # Duplicate of A/cattle/Kansas/24-008766-001/2024
A/Cattle/USA/24-009027-002-original/2024 # Duplicate of A/cattle/Michigan/24-009027-002/2024
A/Cattle/USA/24-009027-002-v/2024 # Duplicate of A/cattle/Michigan/24-009027-002/2024
A/PEFA/USA/24-005915-001-original/2024 # Duplicate of A/Peregrinefalcon/California/24-005915-001/2024
A/Skunk/USA/24-006483-001-original/2024 # Duplicate of A/skunk/NewMexico/24-006483-001/2024

# Dropping these strains from include due to excess private mutations
A/cattle/NorthCarolina/24-010327-002/2024
A/cattle/Texas/24-009495-007/2024
Expand Down Expand Up @@ -86,4 +97,3 @@ A/turkey/SouthDakota/24-007379-002/2024
A/turkeyvulture/California/24-007419-001/2024
A/westerngull/California/24-004708-001/2024
A/woodduck/NorthCarolina/W24-026/2024

19 changes: 16 additions & 3 deletions ingest/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,18 @@ nextstrain build . ingest_andersen_lab --configfile build-configs/ncbi/defaults/

The results will be available in `andersen-lab/results/`.

### Ingest and join NCBI GenBank and Andersen lab data

To ingest and join NCBI GenBank and Andersen lab data, we deduplicate records using SRA accessions.
Andersen lab data with SRA accessions _not_ found in the NCBI GenBank data are appended
to the metadata TSV and sequence FASTA files by running:

```
nextstrain build . ingest_joined_ncbi --configfile build-configs/ncbi/defaults/config.yaml
```

This results in files `metadata.tsv`, `sequences_ha.fast`, etc... under `joined-ncbi/results/`.

#### Upload to S3

To run both NCBI Genbank and Andersent Lab ingests _and_ upload results to S3,
Expand All @@ -49,9 +61,10 @@ nextstrain build \
--configfile build-configs/ncbi/defaults/config.yaml
```

The workflow compresses and uploads the local files to S3 to corresponding paths
under `s3://nextstrain-data/files/workflows/avian-flu/h5n1/ncbi` and
`s3://nextstrain-data/files/workflows/avian-flu/h5n1/andersen-lab`.
The workflow compresses and uploads the local files to S3 to corresponding paths under
- joined-ncbi = `s3://nextstrain-data/files/workflows/avian-flu/h5n1/`
- ncbi = `s3://nextstrain-data/files/workflows/avian-flu/h5n1/ncbi`
- andersen-lab = `s3://nextstrain-data/files/workflows/avian-flu/h5n1/andersen-lab`.

### Ingest and upload data from fauna to S3

Expand Down
2 changes: 1 addition & 1 deletion ingest/Snakefile
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ path_to_fauna = '../fauna'
# Use default configuration values. Override with Snakemake's --configfile/--config options.
configfile: "defaults/config.yaml"

SUPPORTED_DATA_SOURCES = ["fauna", "ncbi", "andersen-lab"]
SUPPORTED_DATA_SOURCES = ["fauna", "ncbi", "andersen-lab", "joined-ncbi"]

wildcard_constraints:
segment = "|".join(config["segments"]),
Expand Down
11 changes: 10 additions & 1 deletion ingest/build-configs/ncbi/Snakefile
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ configfile: "build-configs/ncbi/defaults/config.yaml"
# Sanity check that the requested segments match our ncbi_segments map
assert all(segment in config["ncbi_segments"].keys() for segment in config["segments"])

NCBI_DATA_SOURCES = ["ncbi", "andersen-lab"]
NCBI_DATA_SOURCES = ["ncbi", "andersen-lab", "joined-ncbi"]

rule ingest_ncbi:
input:
Expand All @@ -26,6 +26,14 @@ rule ingest_andersen_lab:
"andersen-lab/results/metadata.tsv",


rule ingest_joined_ncbi:
input:
expand([
"joined-ncbi/results/sequences_{segment}.fasta",
], segment=config["segments"]),
"joined-ncbi/results/metadata.tsv",


# Uploads all results for NCBI and Andersen Lab ingests
rule upload_all_ncbi:
input:
Expand All @@ -39,3 +47,4 @@ rule upload_all_ncbi:
include: "rules/ingest_andersen_lab.smk"
include: "rules/fetch_from_ncbi.smk"
include: "rules/curate.smk"
include: "rules/join_ncbi_andersen.smk"
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,8 @@ NEXTSTRAIN_RECORD = {
'PMID': '?',
'gisaid_clade': '?',
'h5_clade': '?',
'genbank_accession': '?',
'sra_accessions': '?',
}


Expand All @@ -40,6 +42,7 @@ def create_new_record(anderson_record: dict) -> dict:
"""
new_record = copy.deepcopy(NEXTSTRAIN_RECORD)
new_record['isolate_id'] = anderson_record['Run']
new_record['sra_accessions'] = anderson_record['Run']
new_record['division'] = anderson_record['US State']
new_record['location'] = anderson_record['US State']

Expand Down
8 changes: 8 additions & 0 deletions ingest/build-configs/ncbi/defaults/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -126,9 +126,17 @@ curate:
- gisaid_clade
- h5_clade
- genbank_accession
- sra_accessions

s3_dst:
ncbi: s3://nextstrain-data/files/workflows/avian-flu/h5n1/ncbi
andersen-lab: s3://nextstrain-data/files/workflows/avian-flu/h5n1/andersen-lab
joined-ncbi: s3://nextstrain-data/files/workflows/avian-flu/h5n1

cloudfront_domain: data.nextstrain.org

join_ncbi_andersen:
match_field: sra_accessions
source_column_name: data_source
ncbi_source: genbank
andersen_source: andersen-lab
39 changes: 32 additions & 7 deletions ingest/build-configs/ncbi/rules/ingest_andersen_lab.smk
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,7 @@ rule curate_metadata:
"""
augur curate normalize-strings \
--metadata {input.metadata} \
| ./build-configs/ncbi/bin/curate_andersen_lab_data \
| ./build-configs/ncbi/bin/curate-andersen-lab-data \
| ./vendored/apply-geolocation-rules \
--geolocation-rules {input.geolocation_rules} \
| augur curate passthru \
Expand All @@ -99,22 +99,47 @@ rule match_metadata_and_segment_fasta:
metadata = "andersen-lab/data/metadata.tsv",
fasta = "andersen-lab/data/{segment}.fasta"
output:
metadata = "andersen-lab/data/metadata_{segment}.tsv",
fasta = "andersen-lab/results/sequences_{segment}.fasta"
metadata = "andersen-lab/data/matched_metadata_{segment}.tsv",
fasta = "andersen-lab/results/sequences_{segment}.fasta",
params:
input_id_field="isolate_id",
sequence_field="sequence",
output_id_field=config["curate"]["output_id_field"],
log:
"andersen-lab/logs/match_segment_metadata_and_fasta/{segment}.txt",
shell:
"""
augur curate passthru \
--metadata {input.metadata} \
--fasta {input.fasta} \
--seq-id-column isolate_id \
--seq-field sequence \
--seq-id-column {params.input_id_field} \
--seq-field {params.sequence_field} \
--unmatched-reporting warn \
--duplicate-reporting warn \
--output-metadata {output.metadata} \
--output-fasta {output.fasta} \
--output-id-field strain \
--output-seq-field sequence \
--output-id-field {params.output_id_field} \
--output-seq-field {params.sequence_field} \
2> {log}
"""


rule reorder_metadata_columns:
"""
Using tsv-select to reorder the columns of the Andersen lab metadata to
exactly match the NCBI metadata columns. Ensures that we don't accidently
append the wrong columns in joining steps.
tsv-select will exit with error if the column does not exist.
"""
input:
metadata = "andersen-lab/data/matched_metadata_{segment}.tsv",
output:
reordered_metadata = "andersen-lab/data/metadata_{segment}.tsv"
params:
metadata_fields=",".join(config["curate"]["metadata_columns"]),
shell:
"""
tsv-select -H -f {params.metadata_fields} \
{input.metadata} > {output.reordered_metadata}
"""
87 changes: 87 additions & 0 deletions ingest/build-configs/ncbi/rules/join_ncbi_andersen.smk
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
"""
This part of the workflow handles the joining and deduplication of NCBI and
Andersen lab data.
"""


rule select_missing_metadata:
"""
Uses tsv-join --exclude flag to exclude matching records so we are
left with any missing metadata from the Andersen lab.
"""
input:
ncbi_metadata = "ncbi/results/metadata.tsv",
andersen_metadata = "andersen-lab/results/metadata.tsv",
output:
missing_metadata = "joined-ncbi/data/missing_metadata.tsv",
params:
match_field = config["join_ncbi_andersen"]["match_field"],
shell:
"""
tsv-join -H \
--exclude \
--filter-file {input.ncbi_metadata} \
--key-fields {params.match_field} \
{input.andersen_metadata} > {output.missing_metadata}
"""


rule select_missing_strain_names:
input:
missing_metadata = "joined-ncbi/data/missing_metadata.tsv",
output:
missing_sequence_ids = "joined-ncbi/data/missing_sequence_ids.txt",
params:
sequence_id_column = config["curate"]["output_id_field"],
shell:
"""
tsv-select -H -f {params.sequence_id_column} \
{input.missing_metadata} \
> {output.missing_sequence_ids}
"""


rule select_missing_sequences:
input:
missing_sequence_ids = "joined-ncbi/data/missing_sequence_ids.txt",
andersen_sequences = "andersen-lab/results/sequences_{segment}.fasta",
output:
missing_sequences = "joined-ncbi/data/missing_sequences_{segment}.fasta",
shell:
"""
seqkit grep -f {input.missing_sequence_ids} \
{input.andersen_sequences} \
> {output.missing_sequences}
"""


rule append_missing_metadata_to_ncbi:
input:
ncbi_metadata = "ncbi/results/metadata.tsv",
missing_metadata = "joined-ncbi/data/missing_metadata.tsv",
output:
joined_metadata = "joined-ncbi/results/metadata.tsv",
params:
source_column_name = config["join_ncbi_andersen"]["source_column_name"],
ncbi_source = config["join_ncbi_andersen"]["ncbi_source"],
andersen_source = config["join_ncbi_andersen"]["andersen_source"],
shell:
"""
tsv-append \
--source-header {params.source_column_name} \
--file {params.ncbi_source}={input.ncbi_metadata} \
--file {params.andersen_source}={input.missing_metadata} \
> {output.joined_metadata}
"""


rule append_missing_sequences_to_ncbi:
input:
ncbi_sequences = "ncbi/results/sequences_{segment}.fasta",
missing_sequences = "joined-ncbi/data/missing_sequences_{segment}.fasta",
output:
joined_sequences = "joined-ncbi/results/sequences_{segment}.fasta",
shell:
"""
cat {input.ncbi_sequences} {input.missing_sequences} > {output.joined_sequences}
"""

0 comments on commit 047a3a2

Please sign in to comment.