Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add LASV GPC Nextclade dataset #47

Draft
wants to merge 15 commits into
base: main
Choose a base branch
from
Draft

Add LASV GPC Nextclade dataset #47

wants to merge 15 commits into from

Conversation

j23414
Copy link
Contributor

@j23414 j23414 commented Nov 29, 2024

Description of Proposed Changes

This PR introduces a draft GPC Nextclade dataset for the Lassa virus, enabling rapid lineage assignment and real-time mutation tracking. The dataset aims to support ongoing efforts to manage Lassa fever outbreaks by providing an efficient tool for identifying lineages, a critical step in understanding and mitigating the spread of this disease.

Copied below description from parts of #48 (comment)

The Lassa virus (LASV), the causative agent of Lassa fever(LF), is currently categorized into seven distinct lineages circulating in specific geographic regions (Garry, 2023). Lineages 1, 2, and 3 are primarily found in Nigeria, while lineages 4 and 5 are prevalent in Sierra Leone and Mali (Garry, 2023). These lineages not only circulate in different regions but also exhibit significant variations in immune response (Buck et al., 2022) and disease outcomes (Anderson et al., 2015). For instance, Anderson et al. demonstrated that the Sierra Leonean strain tends to be more fatal than the Nigerian strains.

In an effort to address this gap, we developed a tool for fast lineage assignment (Daodu et al., 2024). However, the tool is still limited in its capabilities, including the lack of a user-friendly interface. The success of the Nextclade dataset and rapid lineage assignment in managing the SARS-CoV-2 pandemic highlights the potential value of such resources for LASV control.

Building on the success of the Nextclade dataset for SARS-CoV-2, this PR incorporates lineage assignment from Daodu et al., 2024 (or more specifically these files) into a dedicated LASV GPC Nextclade dataset. This resource aims to fill gaps in LASV outbreak response tools, facilitating mutation tracking, lineage assignment, and broader outbreak management.

Draft Resources

  • Scaffold tree: View scaffold tree
    This tree includes representative sequences from each lineage placed in a phylogenetic structure. New query sequences can be added and assigned to lineages based on their placement.

  • Nextclade dataset: Test dataset
    This link allows users to explore and test the dataset features.

Related Issues(s)

Checklist

  • All checks pass

@j23414 j23414 changed the title WIP: Add nextclade Add LASV GPC Nextclade dataset Jan 9, 2025
@j23414
Copy link
Contributor Author

j23414 commented Jan 14, 2025

Thanks for pointing out the 60:X in the scaffold tree, I manually fixed the alignment in the following dataset:

@@ -1,7 +1,6 @@
accession accession_version strain date region country division location length host is_lab_host date_released date_updated sra_accessions authors abbr_authors institution clade_membership
KM822128 KM822128.1 Pinneo-NIG-1969 1969-XX-XX Africa Nigeria 3428 2014-10-14 2014-10-14 "Andersen,K.G.,Shapiro,B.J.,Matranga,C.B.,Gire,S.K.,Sealfon,R.,England,E.M.,Winnicki,S.,Moses,L.M.,Stremlau,M.,Folarin,O.,Odia,I.,Ehiane,P.,Goba,A.,Momoh,M.,Gnirke,A.,Birren,B.,Hensley,L.,Levin,J.Z.,Happi,C.T.,Garry,R.F.,Sabeti,P.C." Andersen et al. Broad Institute LI
KM821976 KM821976.1 ISTH2066-NIG-2012 2012-XX-XX Africa Nigeria 3406 Homo sapiens 2014-10-14 2014-10-14 "Andersen,K.G.,Shapiro,B.J.,Matranga,C.B.,Gire,S.K.,Sealfon,R.,England,E.M.,Winnicki,S.,Moses,L.M.,Stremlau,M.,Folarin,O.,Odia,I.,Ehiane,P.,Goba,A.,Momoh,M.,Gnirke,A.,Birren,B.,Hensley,L.,Levin,J.Z.,Happi,C.T.,Garry,R.F.,Sabeti,P.C." Andersen et al. Broad Institute LII
KM821977 KM821977.1 ISTH2069-NIG-2012 2012-XX-XX Africa Nigeria 3397 Homo sapiens 2014-10-14 2014-10-14 "Andersen,K.G.,Shapiro,B.J.,Matranga,C.B.,Gire,S.K.,Sealfon,R.,England,E.M.,Winnicki,S.,Moses,L.M.,Stremlau,M.,Folarin,O.,Odia,I.,Ehiane,P.,Goba,A.,Momoh,M.,Gnirke,A.,Birren,B.,Hensley,L.,Levin,J.Z.,Happi,C.T.,Garry,R.F.,Sabeti,P.C." Andersen et al. Broad Institute LII
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were the sequences removed causing misalignment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, KM821977 aligned fine but was dropped since it had a run of n's in the nucleotide. I mostly wanted to avoid having a Nextclade scaffold tree with missing nucleotide information as I wasn't sure how that would influence query sequence placement.

refine:
coalescent: "opt"
date_inference: "marginal"
clock_rate: 0.0006
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we pre-calculate the clock rate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I'm struggling to find the history of why 0.0006. I'll get back to you by end of next week.

@JoiRichi
Copy link
Collaborator

Thanks.

I tested the staged version of the LASV Nextclade with some data.

I am however worried that the pipeline may be deleting some nucleotides in
sequences - which I don't seem to have an answer for yet.

I have attached 3 images which show amino acid translation for OM735982.
Visualization of the alignment generated by Nextclade reveals that position
209 is missing. However, when this alignment was juxtaposed with another
alignment curated earlier, it revealed that a codon was removed by the
pipeline as there was supposed to be an 'N' in position 208 before the 'G'
in 209. This 'N' is present in the official translation on NCBI as shown in
the third image. This may mean that the codon was deleted without an
explanation (yet)

We may need to review how alignments are handled generally.

image image-4

@j23414
Copy link
Contributor Author

j23414 commented Jan 28, 2025

Thanks for raising the concern that the Nextclade alignment appeared to introduce gaps into sequences where a codon existed but was being removed and capturing the screenshot of the alignment! I'm documenting my corroborating exploration of the issue below. Will post some suggested paths forward.

@j23414
Copy link
Contributor Author

j23414 commented Jan 28, 2025

To investigate this issue, I took the following steps:

  1. Downloaded the sequence OM735982 from NCBI and imported it into Geneious.
  2. Retrieved the same sequence from our Nextstrain-ingested data/gpc/sequences.fasta file:
nextstrain build phylogenetic data/gpc/sequences.fasta
smof grep OM735982 phylogenetic/data/gpc/sequences.fasta > OM.fasta # Changed header to 'OM735982_fasta'
  1. Aligned the sequence against the dataset using Nextclade:
nextclade3 run \
  --input-dataset nextclade/dataset.zip \
  --output-all test_out \
  OM.fasta
  1. Loaded all files into Geneious and performed a new alignment (confirming that the gap was present in the Nextclade alignment).
  2. Loaded the Nextclade scaffold sequences into Geneious and performed another alignment.
OM735982_nextcladegap
  • The gap appears at approximately amino acid position ~207 in the GPC gene.
  • My hypothesis is that the codon CAA is being interpreted as a triplet indel compared to the scaffold tree alignment, which seems predominately GGG (or at least not CAA).

Potential Solutions

  1. Add a CAA example sequence to the Nextclade scaffold tree to test if it resolves the indel issue.
    • Notes: This would be a straightforward solution if successful, but it raises questions about the need for frequent updates to the Nextclade dataset and how we flag new mutations.
  2. Adjust the Nextclade alignment parameters by tweaking the alignment settings.
    • Notes: While this approach is more time-consuming, it could offer a more robust solution for handling novel mutations in Lassa sequences. The current parameters might inadvertently treat novel codons as deletions. However, relaxing the alignment parameters may lead to challenges with trying to replicate the results of codon-aware alignments.

[later edit]

  1. Since the dataset seems to annotating the correct lineage calls, delay fixing the mutation calling to a separate PR. Perhaps try to include a disclaimer message on the mutation calling in the meanwhile.

@j23414
Copy link
Contributor Author

j23414 commented Jan 28, 2025

How to explore Nextclade alignment parameters

For Option 2, I wrote up the default command with explicit nextclade parameters (based on nextclade help and diving into the code) to make it easier for others to explore the gappy behavior (and do their own exploration of Nextclade parameters) and how to fix it.

# What I'm inferring are the default values
nextclade3 run \
  --input-dataset nextclade/dataset.zip \
  --output-all test_out \
  OM735982.fasta \
  --min-length 100 \
  --penalty-gap-extend 0 \
  --penalty-gap-open 6 \
  --penalty-gap-open-in-frame 7 \
  --penalty-gap-open-out-of-frame 8 \
  --penalty-mismatch 1 \
  --score-match 3 \
  --retry-reverse-complement true \
  --no-translate-past-stop false \
  --gap-alignment-side left \
  --excess-bandwidth 9 \
  --terminal-bandwidth 50 \
  --allowed-mismatches 8 \
  --min-match-length 40 \
  --max-alignment-attempts 3 \
  --include-reference true \
  --include-nearest-node-info true 

# Still contains the gap
less test_out/nextclade.aligned.fasta

[later edit] I was trying to find out why the default values are not listed in the nextclade run -h and we need to dig into the code, and found closed issue: nextstrain/nextclade#1253

Looks like it's due to a dependency on clap to generate the cli help message.

j23414 added 15 commits February 4, 2025 11:08
Copy the nextclade workflow directory structure from the pathogen-repo-guide.
Subsequent commits will be used to modify the workflow to work with the lassa data.
Use the reference lineage viruses to create input metadata and sequences files.
The sequences are restricted to GPC region and from the manually fixed alignment.
1. Build an all GPC sequences phylogenetic tree with lineage annotations pulled from:
    https://github.com/JoiRichi/LASV_ML_manuscript_data/tree/2118de0d28283b04d07c5c8dbb7aa381ffda2e8d/lineage_annotation
2. Midpoint and color tree by lineages in FigTree
3. Copy tree strain names into text file by lineage
4. Pull every Nth strain to roughly sample the genomic diversity across the tree
5. Aim for ~20 samples per lineage if available
Necessary for the Nextclade extension to run without errors.
@j23414
Copy link
Contributor Author

j23414 commented Feb 5, 2025

Pivot away from option 2 (adjusting Nextclade params) to option 1 (adding CAA samples to the scaffold tree)

I'm not finding a set of Nextclade parameters to rescue the CAA deletion. If someone else does and can demonstrate that it works for OM735982, please post the command here!

I'm pivoting to try option 1 of adding a few CAA samples to the scaffold tree.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants