-
Notifications
You must be signed in to change notification settings - Fork 9
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #50 from nextstrain/ingest-join-ncbi-andersen
ingest: Join NCBI GenBank and Andersen Lab data
- Loading branch information
Showing
13 changed files
with
184 additions
and
17 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,6 @@ | ||
We gratefully acknowledge the authors, originating and submitting laboratories of the genetic sequences and metadata for sharing their work. Please note that although data generators have generously shared data in an open fashion, that does not mean there should be free license to publish on this data. Data generators should be cited where possible and collaborations should be sought in some circumstances. Please try to avoid scooping someone else's work. Reach out if uncertain. | ||
|
||
Genomic data from the ongoing outbreak of H5N1 in cattle in the US was shared by the [National Veterinary Services Laboratories (NVSL)](https://www.aphis.usda.gov/labs/about-nvsl) of the [Animal and Plant Health Inspection Service (APHIS)](https://www.aphis.usda.gov/) of the U.S. Department of Agriculture (USDA) in an open fashion to NCBI GenBank. | ||
|
||
NCBI GenBank data is supplemented with publicly available consensus sequences and metadata | ||
from Andersen lab's [avian-influenza repo](https://github.com/andersen-lab/avian-influenza). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,87 @@ | ||
""" | ||
This part of the workflow handles the joining and deduplication of NCBI and | ||
Andersen lab data. | ||
""" | ||
|
||
|
||
rule select_missing_metadata: | ||
""" | ||
Uses tsv-join --exclude flag to exclude matching records so we are | ||
left with any missing metadata from the Andersen lab. | ||
""" | ||
input: | ||
ncbi_metadata = "ncbi/results/metadata.tsv", | ||
andersen_metadata = "andersen-lab/results/metadata.tsv", | ||
output: | ||
missing_metadata = "joined-ncbi/data/missing_metadata.tsv", | ||
params: | ||
match_field = config["join_ncbi_andersen"]["match_field"], | ||
shell: | ||
""" | ||
tsv-join -H \ | ||
--exclude \ | ||
--filter-file {input.ncbi_metadata} \ | ||
--key-fields {params.match_field} \ | ||
{input.andersen_metadata} > {output.missing_metadata} | ||
""" | ||
|
||
|
||
rule select_missing_strain_names: | ||
input: | ||
missing_metadata = "joined-ncbi/data/missing_metadata.tsv", | ||
output: | ||
missing_sequence_ids = "joined-ncbi/data/missing_sequence_ids.txt", | ||
params: | ||
sequence_id_column = config["curate"]["output_id_field"], | ||
shell: | ||
""" | ||
tsv-select -H -f {params.sequence_id_column} \ | ||
{input.missing_metadata} \ | ||
> {output.missing_sequence_ids} | ||
""" | ||
|
||
|
||
rule select_missing_sequences: | ||
input: | ||
missing_sequence_ids = "joined-ncbi/data/missing_sequence_ids.txt", | ||
andersen_sequences = "andersen-lab/results/sequences_{segment}.fasta", | ||
output: | ||
missing_sequences = "joined-ncbi/data/missing_sequences_{segment}.fasta", | ||
shell: | ||
""" | ||
seqkit grep -f {input.missing_sequence_ids} \ | ||
{input.andersen_sequences} \ | ||
> {output.missing_sequences} | ||
""" | ||
|
||
|
||
rule append_missing_metadata_to_ncbi: | ||
input: | ||
ncbi_metadata = "ncbi/results/metadata.tsv", | ||
missing_metadata = "joined-ncbi/data/missing_metadata.tsv", | ||
output: | ||
joined_metadata = "joined-ncbi/results/metadata.tsv", | ||
params: | ||
source_column_name = config["join_ncbi_andersen"]["source_column_name"], | ||
ncbi_source = config["join_ncbi_andersen"]["ncbi_source"], | ||
andersen_source = config["join_ncbi_andersen"]["andersen_source"], | ||
shell: | ||
""" | ||
tsv-append \ | ||
--source-header {params.source_column_name} \ | ||
--file {params.ncbi_source}={input.ncbi_metadata} \ | ||
--file {params.andersen_source}={input.missing_metadata} \ | ||
> {output.joined_metadata} | ||
""" | ||
|
||
|
||
rule append_missing_sequences_to_ncbi: | ||
input: | ||
ncbi_sequences = "ncbi/results/sequences_{segment}.fasta", | ||
missing_sequences = "joined-ncbi/data/missing_sequences_{segment}.fasta", | ||
output: | ||
joined_sequences = "joined-ncbi/results/sequences_{segment}.fasta", | ||
shell: | ||
""" | ||
cat {input.ncbi_sequences} {input.missing_sequences} > {output.joined_sequences} | ||
""" |