-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add LASV GPC Nextclade dataset #47
base: main
Are you sure you want to change the base?
Conversation
Thanks for pointing out the 60:X in the scaffold tree, I manually fixed the alignment in the following dataset: |
nextclade/data/metadata.tsv
Outdated
@@ -1,7 +1,6 @@ | |||
accession accession_version strain date region country division location length host is_lab_host date_released date_updated sra_accessions authors abbr_authors institution clade_membership | |||
KM822128 KM822128.1 Pinneo-NIG-1969 1969-XX-XX Africa Nigeria 3428 2014-10-14 2014-10-14 "Andersen,K.G.,Shapiro,B.J.,Matranga,C.B.,Gire,S.K.,Sealfon,R.,England,E.M.,Winnicki,S.,Moses,L.M.,Stremlau,M.,Folarin,O.,Odia,I.,Ehiane,P.,Goba,A.,Momoh,M.,Gnirke,A.,Birren,B.,Hensley,L.,Levin,J.Z.,Happi,C.T.,Garry,R.F.,Sabeti,P.C." Andersen et al. Broad Institute LI | |||
KM821976 KM821976.1 ISTH2066-NIG-2012 2012-XX-XX Africa Nigeria 3406 Homo sapiens 2014-10-14 2014-10-14 "Andersen,K.G.,Shapiro,B.J.,Matranga,C.B.,Gire,S.K.,Sealfon,R.,England,E.M.,Winnicki,S.,Moses,L.M.,Stremlau,M.,Folarin,O.,Odia,I.,Ehiane,P.,Goba,A.,Momoh,M.,Gnirke,A.,Birren,B.,Hensley,L.,Levin,J.Z.,Happi,C.T.,Garry,R.F.,Sabeti,P.C." Andersen et al. Broad Institute LII | |||
KM821977 KM821977.1 ISTH2069-NIG-2012 2012-XX-XX Africa Nigeria 3397 Homo sapiens 2014-10-14 2014-10-14 "Andersen,K.G.,Shapiro,B.J.,Matranga,C.B.,Gire,S.K.,Sealfon,R.,England,E.M.,Winnicki,S.,Moses,L.M.,Stremlau,M.,Folarin,O.,Odia,I.,Ehiane,P.,Goba,A.,Momoh,M.,Gnirke,A.,Birren,B.,Hensley,L.,Levin,J.Z.,Happi,C.T.,Garry,R.F.,Sabeti,P.C." Andersen et al. Broad Institute LII |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Were the sequences removed causing misalignment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, KM821977 aligned fine but was dropped since it had a run of n's in the nucleotide. I mostly wanted to avoid having a Nextclade scaffold tree with missing nucleotide information as I wasn't sure how that would influence query sequence placement.
refine: | ||
coalescent: "opt" | ||
date_inference: "marginal" | ||
clock_rate: 0.0006 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did we pre-calculate the clock rate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but I'm struggling to find the history of why 0.0006
. I'll get back to you by end of next week.
Thanks for raising the concern that the Nextclade alignment appeared to introduce gaps into sequences where a codon existed but was being removed and capturing the screenshot of the alignment! I'm documenting my corroborating exploration of the issue below. Will post some suggested paths forward. |
To investigate this issue, I took the following steps:
nextstrain build phylogenetic data/gpc/sequences.fasta
smof grep OM735982 phylogenetic/data/gpc/sequences.fasta > OM.fasta # Changed header to 'OM735982_fasta'
nextclade3 run \
--input-dataset nextclade/dataset.zip \
--output-all test_out \
OM.fasta
Potential Solutions
[later edit]
|
How to explore Nextclade alignment parametersFor Option 2, I wrote up the default command with explicit nextclade parameters (based on # What I'm inferring are the default values
nextclade3 run \
--input-dataset nextclade/dataset.zip \
--output-all test_out \
OM735982.fasta \
--min-length 100 \
--penalty-gap-extend 0 \
--penalty-gap-open 6 \
--penalty-gap-open-in-frame 7 \
--penalty-gap-open-out-of-frame 8 \
--penalty-mismatch 1 \
--score-match 3 \
--retry-reverse-complement true \
--no-translate-past-stop false \
--gap-alignment-side left \
--excess-bandwidth 9 \
--terminal-bandwidth 50 \
--allowed-mismatches 8 \
--min-match-length 40 \
--max-alignment-attempts 3 \
--include-reference true \
--include-nearest-node-info true
# Still contains the gap
less test_out/nextclade.aligned.fasta [later edit] I was trying to find out why the default values are not listed in the Looks like it's due to a dependency on clap to generate the cli help message. |
Copy the nextclade workflow directory structure from the pathogen-repo-guide. Subsequent commits will be used to modify the workflow to work with the lassa data.
Use the reference lineage viruses to create input metadata and sequences files. The sequences are restricted to GPC region and from the manually fixed alignment.
1. Build an all GPC sequences phylogenetic tree with lineage annotations pulled from: https://github.com/JoiRichi/LASV_ML_manuscript_data/tree/2118de0d28283b04d07c5c8dbb7aa381ffda2e8d/lineage_annotation 2. Midpoint and color tree by lineages in FigTree 3. Copy tree strain names into text file by lineage 4. Pull every Nth strain to roughly sample the genomic diversity across the tree 5. Aim for ~20 samples per lineage if available
Necessary for the Nextclade extension to run without errors.
Pivot away from option 2 (adjusting Nextclade params) to option 1 (adding
|
Description of Proposed Changes
This PR introduces a draft GPC Nextclade dataset for the Lassa virus, enabling rapid lineage assignment and real-time mutation tracking. The dataset aims to support ongoing efforts to manage Lassa fever outbreaks by providing an efficient tool for identifying lineages, a critical step in understanding and mitigating the spread of this disease.
Copied below description from parts of #48 (comment)
Building on the success of the Nextclade dataset for SARS-CoV-2, this PR incorporates lineage assignment from Daodu et al., 2024 (or more specifically these files) into a dedicated LASV GPC Nextclade dataset. This resource aims to fill gaps in LASV outbreak response tools, facilitating mutation tracking, lineage assignment, and broader outbreak management.
Draft Resources
Scaffold tree: View scaffold tree
This tree includes representative sequences from each lineage placed in a phylogenetic structure. New query sequences can be added and assigned to lineages based on their placement.
Nextclade dataset: Test dataset
This link allows users to explore and test the dataset features.
Related Issues(s)
Checklist