Add LASV GPC Nextclade dataset #47

j23414 · 2024-11-29T17:17:39Z

Description of Proposed Changes

This PR introduces a draft GPC Nextclade dataset for the Lassa virus, enabling rapid lineage assignment and real-time mutation tracking. The dataset aims to support ongoing efforts to manage Lassa fever outbreaks by providing an efficient tool for identifying lineages, a critical step in understanding and mitigating the spread of this disease.

Copied below description from parts of #48 (comment)

The Lassa virus (LASV), the causative agent of Lassa fever(LF), is currently categorized into seven distinct lineages circulating in specific geographic regions (Garry, 2023). Lineages 1, 2, and 3 are primarily found in Nigeria, while lineages 4 and 5 are prevalent in Sierra Leone and Mali (Garry, 2023). These lineages not only circulate in different regions but also exhibit significant variations in immune response (Buck et al., 2022) and disease outcomes (Anderson et al., 2015). For instance, Anderson et al. demonstrated that the Sierra Leonean strain tends to be more fatal than the Nigerian strains.

In an effort to address this gap, we developed a tool for fast lineage assignment (Daodu et al., 2024). However, the tool is still limited in its capabilities, including the lack of a user-friendly interface. The success of the Nextclade dataset and rapid lineage assignment in managing the SARS-CoV-2 pandemic highlights the potential value of such resources for LASV control.

Building on the success of the Nextclade dataset for SARS-CoV-2, this PR incorporates lineage assignment from Daodu et al., 2024 (or more specifically these files) into a dedicated LASV GPC Nextclade dataset. This resource aims to fill gaps in LASV outbreak response tools, facilitating mutation tracking, lineage assignment, and broader outbreak management.

Draft Resources

Scaffold tree: View scaffold tree
This tree includes representative sequences from each lineage placed in a phylogenetic structure. New query sequences can be added and assigned to lineages based on their placement.
Nextclade dataset: Test dataset
This link allows users to explore and test the dataset features.

Related Issues(s)

Add Nextclade dataset(s) #48

Checklist

All checks pass

j23414 · 2025-01-14T01:28:33Z

Thanks for pointing out the 60:X in the scaffold tree, I manually fixed the alignment in the following dataset:

JoiRichi · 2025-01-16T11:58:07Z

nextclade/data/metadata.tsv

@@ -1,7 +1,6 @@
 accession	accession_version	strain	date	region	country	division	location	length	host	is_lab_host	date_released	date_updated	sra_accessions	authors	abbr_authors	institution	clade_membership
 KM822128	KM822128.1	Pinneo-NIG-1969	1969-XX-XX	Africa	Nigeria			3428			2014-10-14	2014-10-14		"Andersen,K.G.,Shapiro,B.J.,Matranga,C.B.,Gire,S.K.,Sealfon,R.,England,E.M.,Winnicki,S.,Moses,L.M.,Stremlau,M.,Folarin,O.,Odia,I.,Ehiane,P.,Goba,A.,Momoh,M.,Gnirke,A.,Birren,B.,Hensley,L.,Levin,J.Z.,Happi,C.T.,Garry,R.F.,Sabeti,P.C."	Andersen et al.	Broad Institute	LI
 KM821976	KM821976.1	ISTH2066-NIG-2012	2012-XX-XX	Africa	Nigeria			3406	Homo sapiens		2014-10-14	2014-10-14		"Andersen,K.G.,Shapiro,B.J.,Matranga,C.B.,Gire,S.K.,Sealfon,R.,England,E.M.,Winnicki,S.,Moses,L.M.,Stremlau,M.,Folarin,O.,Odia,I.,Ehiane,P.,Goba,A.,Momoh,M.,Gnirke,A.,Birren,B.,Hensley,L.,Levin,J.Z.,Happi,C.T.,Garry,R.F.,Sabeti,P.C."	Andersen et al.	Broad Institute	LII
-KM821977	KM821977.1	ISTH2069-NIG-2012	2012-XX-XX	Africa	Nigeria			3397	Homo sapiens		2014-10-14	2014-10-14		"Andersen,K.G.,Shapiro,B.J.,Matranga,C.B.,Gire,S.K.,Sealfon,R.,England,E.M.,Winnicki,S.,Moses,L.M.,Stremlau,M.,Folarin,O.,Odia,I.,Ehiane,P.,Goba,A.,Momoh,M.,Gnirke,A.,Birren,B.,Hensley,L.,Levin,J.Z.,Happi,C.T.,Garry,R.F.,Sabeti,P.C."	Andersen et al.	Broad Institute	LII


Were the sequences removed causing misalignment?

In this case, KM821977 aligned fine but was dropped since it had a run of n's in the nucleotide. I mostly wanted to avoid having a Nextclade scaffold tree with missing nucleotide information as I wasn't sure how that would influence query sequence placement.

JoiRichi · 2025-01-16T12:26:16Z

nextclade/defaults/config.yaml

+refine:
+  coalescent: "opt"
+  date_inference: "marginal"
+  clock_rate: 0.0006


Did we pre-calculate the clock rate?

Yes, but I'm struggling to find the history of why 0.0006. I'll get back to you by end of next week.

JoiRichi · 2025-01-16T13:54:52Z

Thanks.

I tested the staged version of the LASV Nextclade with some data.

I am however worried that the pipeline may be deleting some nucleotides in
sequences - which I don't seem to have an answer for yet.

I have attached 3 images which show amino acid translation for OM735982.
Visualization of the alignment generated by Nextclade reveals that position
209 is missing. However, when this alignment was juxtaposed with another
alignment curated earlier, it revealed that a codon was removed by the
pipeline as there was supposed to be an 'N' in position 208 before the 'G'
in 209. This 'N' is present in the official translation on NCBI as shown in
the third image. This may mean that the codon was deleted without an
explanation (yet)

We may need to review how alignments are handled generally.

j23414 · 2025-01-28T18:28:33Z

Thanks for raising the concern that the Nextclade alignment appeared to introduce gaps into sequences where a codon existed but was being removed and capturing the screenshot of the alignment! I'm documenting my corroborating exploration of the issue below. Will post some suggested paths forward.

j23414 · 2025-01-28T18:30:38Z

To investigate this issue, I took the following steps:

Downloaded the sequence OM735982 from NCBI and imported it into Geneious.
Retrieved the same sequence from our Nextstrain-ingested data/gpc/sequences.fasta file:

nextstrain build phylogenetic data/gpc/sequences.fasta
smof grep OM735982 phylogenetic/data/gpc/sequences.fasta > OM.fasta # Changed header to 'OM735982_fasta'

Aligned the sequence against the dataset using Nextclade:

nextclade3 run \
  --input-dataset nextclade/dataset.zip \
  --output-all test_out \
  OM.fasta

Loaded all files into Geneious and performed a new alignment (confirming that the gap was present in the Nextclade alignment).
Loaded the Nextclade scaffold sequences into Geneious and performed another alignment.

The gap appears at approximately amino acid position ~207 in the GPC gene.
My hypothesis is that the codon CAA is being interpreted as a triplet indel compared to the scaffold tree alignment, which seems predominately GGG (or at least not CAA).

Potential Solutions

Add a CAA example sequence to the Nextclade scaffold tree to test if it resolves the indel issue.
- Notes: This would be a straightforward solution if successful, but it raises questions about the need for frequent updates to the Nextclade dataset and how we flag new mutations.
Adjust the Nextclade alignment parameters by tweaking the alignment settings.
- Notes: While this approach is more time-consuming, it could offer a more robust solution for handling novel mutations in Lassa sequences. The current parameters might inadvertently treat novel codons as deletions. However, relaxing the alignment parameters may lead to challenges with trying to replicate the results of codon-aware alignments.

[later edit]

Since the dataset seems to annotating the correct lineage calls, delay fixing the mutation calling to a separate PR. Perhaps try to include a disclaimer message on the mutation calling in the meanwhile.

j23414 · 2025-01-28T20:14:40Z

How to explore Nextclade alignment parameters

For Option 2, I wrote up the default command with explicit nextclade parameters (based on nextclade help and diving into the code) to make it easier for others to explore the gappy behavior (and do their own exploration of Nextclade parameters) and how to fix it.

# What I'm inferring are the default values
nextclade3 run \
  --input-dataset nextclade/dataset.zip \
  --output-all test_out \
  OM735982.fasta \
  --min-length 100 \
  --penalty-gap-extend 0 \
  --penalty-gap-open 6 \
  --penalty-gap-open-in-frame 7 \
  --penalty-gap-open-out-of-frame 8 \
  --penalty-mismatch 1 \
  --score-match 3 \
  --retry-reverse-complement true \
  --no-translate-past-stop false \
  --gap-alignment-side left \
  --excess-bandwidth 9 \
  --terminal-bandwidth 50 \
  --allowed-mismatches 8 \
  --min-match-length 40 \
  --max-alignment-attempts 3 \
  --include-reference true \
  --include-nearest-node-info true 

# Still contains the gap
less test_out/nextclade.aligned.fasta

[later edit] I was trying to find out why the default values are not listed in the nextclade run -h and we need to dig into the code, and found closed issue: nextstrain/nextclade#1253

Looks like it's due to a dependency on clap to generate the cli help message.

Copy the nextclade workflow directory structure from the pathogen-repo-guide. Subsequent commits will be used to modify the workflow to work with the lassa data.

Use the reference lineage viruses to create input metadata and sequences files. The sequences are restricted to GPC region and from the manually fixed alignment.

1. Build an all GPC sequences phylogenetic tree with lineage annotations pulled from: https://github.com/JoiRichi/LASV_ML_manuscript_data/tree/2118de0d28283b04d07c5c8dbb7aa381ffda2e8d/lineage_annotation 2. Midpoint and color tree by lineages in FigTree 3. Copy tree strain names into text file by lineage 4. Pull every Nth strain to roughly sample the genomic diversity across the tree 5. Aim for ~20 samples per lineage if available

Necessary for the Nextclade extension to run without errors.

…pice config

…ld tree

j23414 · 2025-02-05T00:37:00Z

Pivot away from option 2 (adjusting Nextclade params) to option 1 (adding `CAA` samples to the scaffold tree)

I'm not finding a set of Nextclade parameters to rescue the CAA deletion. If someone else does and can demonstrate that it works for OM735982, please post the command here!

I'm pivoting to try option 1 of adding a few CAA samples to the scaffold tree.

j23414 force-pushed the add-nextclade branch from 95f8ab6 to 62d6e84 Compare January 7, 2025 17:49

j23414 changed the title ~~WIP: Add nextclade~~ Add LASV GPC Nextclade dataset Jan 9, 2025

JoiRichi reviewed Jan 16, 2025

View reviewed changes

j23414 mentioned this pull request Jan 17, 2025

Ingest: Update split by segment reference to use Josiah strain #51

Merged

1 task

j23414 added 15 commits February 4, 2025 11:08

Initial copy of nextclade workflow from pathogen-repo-guide

4a77cbc

Copy the nextclade workflow directory structure from the pathogen-repo-guide. Subsequent commits will be used to modify the workflow to work with the lassa data.

Reference lineage viruses for Nextclade dataset

ca46089

Nextclade: Input data

53895a0

Use the reference lineage viruses to create input metadata and sequences files. The sequences are restricted to GPC region and from the manually fixed alignment.

Nextclade: Build lassa GPC scaffold tree

4515455

Nextclade: Add nextclade extension

d170d8f

Nextclade: Update input data

f55d5d2

Nextclade: Add root or reference sequence

45e016e

Necessary for the Nextclade extension to run without errors.

Nextclade: Add auxillary files

1d688ab

Nextclade: Assemble Nextclade dataset

75b9ece

Nextclade: Initial dataset

503de4a

Nextclade: optionally propagate various nextclade params into the aus…

8828e50

…pice config

Fixup: Manually corrected alignment to be codon aware

8d50fbc

Nextclade: Update dataset

336f2a9

Nextclade: Drop sequences with N or potential frameshifts from scaffo…

7ea8f3a

…ld tree

j23414 force-pushed the add-nextclade branch from 006f8db to 7ea8f3a Compare February 4, 2025 19:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add LASV GPC Nextclade dataset #47

Add LASV GPC Nextclade dataset #47

j23414 commented Nov 29, 2024 •

edited

Loading

j23414 commented Jan 14, 2025

JoiRichi Jan 16, 2025

j23414 Jan 24, 2025

JoiRichi Jan 16, 2025

j23414 Jan 28, 2025

JoiRichi commented Jan 16, 2025

j23414 commented Jan 28, 2025

j23414 commented Jan 28, 2025 •

edited

Loading

j23414 commented Jan 28, 2025 •

edited

Loading

j23414 commented Feb 5, 2025

Add LASV GPC Nextclade dataset #47

Are you sure you want to change the base?

Add LASV GPC Nextclade dataset #47

Conversation

j23414 commented Nov 29, 2024 • edited Loading

Description of Proposed Changes

Draft Resources

Related Issues(s)

Checklist

j23414 commented Jan 14, 2025

JoiRichi Jan 16, 2025

Choose a reason for hiding this comment

j23414 Jan 24, 2025

Choose a reason for hiding this comment

JoiRichi Jan 16, 2025

Choose a reason for hiding this comment

j23414 Jan 28, 2025

Choose a reason for hiding this comment

JoiRichi commented Jan 16, 2025

j23414 commented Jan 28, 2025

j23414 commented Jan 28, 2025 • edited Loading

Potential Solutions

j23414 commented Jan 28, 2025 • edited Loading

How to explore Nextclade alignment parameters

j23414 commented Feb 5, 2025

Pivot away from option 2 (adjusting Nextclade params) to option 1 (adding CAA samples to the scaffold tree)

j23414 commented Nov 29, 2024 •

edited

Loading

j23414 commented Jan 28, 2025 •

edited

Loading

j23414 commented Jan 28, 2025 •

edited

Loading

Pivot away from option 2 (adjusting Nextclade params) to option 1 (adding `CAA` samples to the scaffold tree)