Skip to content

Latest commit

 

History

History
311 lines (252 loc) · 27.7 KB

bioinfo-solutions.md

File metadata and controls

311 lines (252 loc) · 27.7 KB

Bioinformatics Solutions for SARS-CoV-2 Genomic Analysis

PHA4GE Bioinformatics Pipelines & Visualization Working Group
Libuit KG, Park D, van Heusden P, Neher R, Kapsak CJ, Southgate J, Bridges D, Mboowa G, Lunn S, Langhorst B

Overview

Genomic analysis of SARS-CoV-2 (SC2) samples is an increasingly critical function to public health laboratories around the world. Integration of the appropriate bioinformatics solutions to support these works, however, can be an overwhelming challenge.

In an attempt to assist this integration process, the Bioinformatics Pipelines and Visualization Working Group of the Public Health Alliance for Genomic Epidemiology (PHA4GE) has drafted this living document to help define the major bioinformatics challenges for SC2 genomic analysis and suggest various open-source and freely available bioinformatics resources to address them.

Please note that the bioinformatics resources listed in this document are simply an attempt to highlight the most accessible solutions as per the opinions of our working group and in no way represent a comprehensive list of all available SC2 bioinformatics resources. If this document fails to include a valuable public health resource or in some way mischaracterizes a resource mentioned, we encourage community collaboration through pull-requests and/or raised GitHub issues.

Bioinformatics Challenges for Public Health

The PHA4GE Bioinformatics Pipeline and Visualization Working Group has defined four key public health bioinformatics challenges for genomic analysis of SC2 samples:

  1. Generating consensus assemblies from PCR tiling NGS data: Tiled amplicon sequencing--through the Artic V3 protocol, for example--is the most commonly adopted method for generating SC2 sequencing data. These sequencing experiments generate thousands of amplicon reads that represent fragments of the original SC2 genome present in a sample. As a result, one of the initial bioinformatics challenges laboratories face is the assembly of PCR tiling NGS data into a contiguous SC2 genome from which powerful public health insights can be derived, such as lineage typing and genomic epidemiology studies that help inform public-health decision making.

  2. Submitting raw sequence data (fastq), consensus assemblies (fasta), and relevant sample metadata to internationally-accessible databases: Sharing of sample read and assembly data through internationally accessible databases allows insights to be drawn about how the virus is spreading and mutating across the globe; the more freely available these data are to international researchers and public health scientists, the stronger our decision making can be.

  3. Screening sequenced SC2 samples for variants of concern: The detection of certain genetic variants of the SARS-CoV-2 virus may have a significant impact on the decisions of public health officials. Thus, an ability to accurately and reliably screen for variants of interest (VoI) and variants of concern(VoC), such as B.1.1.7 (Alpha) or B.1.617.2 (Delta), is a critical component to the bioinformatics analysis of SC2 genomes.

  4. Performing phylogenetic analysis of SC2 datasets: Genetic relatedness as inferred through phylogenetic analysis of SC2 datasets can be a powerful proxy for epidemiological associations that help resolve transmission networks, enable real-time surveillance, provide insights of the variance-over-time of SC2 samples, and support local outbreak investigations

Open-Access/Source Bioinformatics Solutions & Resources

1. Generating consensus assemblies from PCR tiling NGS data

The bioinformatics resources listed below are open-source pipelines that run on general-purpose, containerized workflow infrastructure to generate consensus SC2 assemblies from PCR tiling NGS data. While some parameters and modules may differ slightly, each pipeline will perform read mapping to the Wuhan-1 reference genome, remove primer regions from the mapped read data, and generate a consensus assembly based on conserved and variant positions identified in the resulting alignment. These resources have been organized into three categories: Terra and Galaxy Workflows, Web-Accessible Software as a Service (SaaS) Solutions, and Command-Line Interface (CLI) tools and are listed in no particular order.

Terra and Galaxy Workflows
  • Broad viral-ngs
    • Brief Description: The viral-ngs workflow collection contains many tools for viral analysis. The consensus genome caller is called assemble_refbased and should work for any low-diversity microbial genome and is appropriate for viruses stemming from a single point-source outbreak, such as SARS-CoV-2. Accepts Illumina paired, single, or mixed reads, as well as ONT reads. Accepts metagenomic or amplicon-based reads with primer trimming.
    • Developed/supported by: Broad Institute Viral Genomics
    • Documentation: Technical documentation (ReadTheDocs)
    • User base: H3Africa West African sites (RUN, KGH, UCAD)
    • Workflow language: WDL
      • Web/Cloud GUI Platforms: Terra, DNAnexus
      • CLI Platforms: Cromwell (local HPC, cloud), miniWDL
  • Theiagen's Public Health Viral Genomics WDL Workflows
    • Brief Description: Theiagen's Public Health Viral Genomics WDL Workflows include four separate WDL workflows (Titan_Illumina_PE, Titan_Illumina_SE, Titan_ClearLabs, and Titan_ONT) that process NGS read data from four different sequencing approaches: Illumina paired-end, Illumina single-end, Clear Labs, and Oxford Nanopore Technology (ONT)) to generate consensus assemblies, produce relevant quality-control metrics for both the input read data and the generated assembly, and assign samples with a lineage and clade designation using Pangolin and NextClade, respectively.
    • Developed/supported by: Theiagen Genomics
    • Documentation: Technical documentation (ReadTheDocs), step-by-step protocols (Protocols.io), and video tutorials (YouTube Playlist)
    • User base: US PHLs
    • Workflow language: WDL
      • Web/Cloud GUI Platforms: Terra
      • CLI Platforms: Cromwell (local HPC, cloud), miniWDL
  • COVID-19 Galaxy Workflows
Web-Accessible SaaS Solutions
  • IDSeq
    • Brief Description: User-friendly software platform originally developed for metagenomics studies that has since been repurposed to include SC2 consensus assembly from Oxford Nanopore or paired-end Illumina data
    • Developed/supported by: Chan Zuckerberg Initiative (CZI)
    • User base: CZ Biohub & partners; access available on request to other users
    • User-interface : Web application on CZI-funded AWS
  • EDGE COVID-19
    • Brief Description: EDGE COVID-19 is a derivative of the original EDGE Bioinformatics software (Li et al. 2017) that was developed to perform reference-based SC2 assemblies and quality assessment of Illumina or Nanopore read data.
    • Developed/supported by: Los Alamos National Laboratories
    • Documentation: EDGE COVID-19 User Guide
    • User base: LANL & partners
    • User-interface: Web application on LANL hardware, local instance using Docker
Command-line interface (CLI) Tools
  • SIGNAL (SARS-CoV-2 Illumina GeNome Assembly Line; CanCOGeN/OnCOV)
    • Brief Description: Quality control, assembly, and analysis snakemake workflow for Illumina-based viral amplicon sequencing. Includes de-hosting via competitive mapping, freebayes variant and consensus generation, lineage assignment, interactive HTML run summaries, and integration with the ncov-tools QC workflow.
    • Developed/supported by: CARD/McArthur Lab, lead maintainers: Jalees Nasir & Finlay Maguire
    • Documentation: Technical Documentation (GitHub README)
    • User base: CA PHLs & academic partners
    • User-interface: CLI (Snakemake)
  • ARTIC nCOV19 (ARTIC Network; Connor-lab)
    • Brief Description: Configured conda environment that enables access to Oxford Nanopore or Illumina consensus sequence assemblers: Medaka (ONT), NanoPolish (ONT) or BWA (Illumina)
    • Developed/supported by: COG UK / ARTIC
    • Documentation: Technical Documentation (GitHub README)
    • User base: COG UK
    • Workflow language: Nextflow
      • CLI Platforms: Nextflow cli client, Nextflow Tower (local HPC, cloud, etc)
  • StaPH-B ToolKit
    • Brief Description: Two StaPH-B workflows for performing SC2 consensus genome assembly have been available: Cecret, a pipeline developed for the analysis of single or paired-end Illumina reads. and Monroe, a workflow with various subcommands that perform consensus genome assembly from either Illumina or Nanopore read data.
    • Developed/supported by: StaPH-B
    • Documentation: https://staph-b.github.io/staphb_toolkit/, Python Package Index (PyPI)
    • User base: US PHLs
    • User-interface: CLI (Python package)

2. Submitting raw sequence data (fastq), consensus assemblies (fasta), and relevant sample metadata to internationally-accessible databases

Below is a list of resources developed to assist in the preparation and submission of raw NGS read data (fastq files), SC2 consensus assemblies (fasta files), and contextual sample metadata to internationally-accessible databases such as NCBI, ENA, and GISAID. We have also included a list of bioinformatics software designed to assess the quality of SC2 data; we recommend the use of such software prior to submission to avoid the inadvertent sharing of poor quality, contaminated, or otherwise misleading SC2 data. Additional information regarding the interpretation of read and assembly quality metrics for SC2 data will be made available as a separate document.

Recommended SC2 Sample Metadata Specifications
  • PHA4GE Contextual Data Specifications
    • Database Target(s): GISAID, ENA, SRA, Genbank
    • Brief Description: A SARS-CoV-2 contextual data specification based on harmonizable, publicly available, community standards. The specification is implementable via a collection template, as well as an array of protocols and tools to support the harmonization and submission of sequence data and contextual information to public repositories.
    • Developed/supported by: PHA4GE
    • Documentation: Technical documentation (GitHub README)
    • User base: Global public health community
    • Protocols: NCBI Submission, ENA Submission, & GISAID Submission
Bioinformatics Solutions to Prepare and/or Submit SC2 Sample Data
Bioinformatics Solutions to Assess Data Quality Prior to Submission

3. Screening sequenced SC2 samples for variants of concern & general lineage typing

These tools either assign a clade or lineage descriptor to consensus sequences or provide databases for lookup of information on variants in the SARS-CoV-2 genome. As variants of concern are listed by their lineage descriptor (typically PANGO lineage or sometimes Nextclade clades) these tools help identify variants of concern.

Bioinformatics tools for SC2 lineage or clade assignment
Public Health Resources that Track & Visualize SC2 Variants Over Time
  • PANGO cov-lineages
    • Brief Description: Track global prevalences of PANGO lineages
    • Developed/supported by: Pangolin Network
  • Covariants
    • Brief Description: Track global prevalence of Nextclade-annotated lineages
    • Developed/supported by: NextStrain Team
  • Outbreak.info
    • Brief Description: Epidemiological info including PANGO lineage prevalence
    • Developed/supported by: Su, Wu, and Andersen labs at Scripps Research
  • COV-GLUE
    • Brief Description: CoV-GLUE contains a database of amino acid replacements, insertions and deletions which have been observed in GISAID hCoV-19 sequences sampled from the pandemic Epidemiological info including PANGO lineage prevalence
    • Developed/supported by: COG-UK
  • 2019nCoVR
    • Brief Description :2019nCoVR features comprehensive integration of genomic and proteomic sequences as well as their metadata information from the GISAID, NCBI, NMDC and CNCB/NGDC. It also incorporates a wide range of relevant information including scientific literatures, news, and popular articles for science dissemination, and provides visualization functionalities for genome variation analysis results based on all collected SARS-CoV-2 strains.
    • Developed/supported by: China National Center for Bioinformation (CNCB)
  • CoVizu
  • Annotation of SARS-2 Coronavirus Genome (Observable)
    • Brief Description: Annotation of variation in the genome with some notes on what is known about the various amino acids
    • Developed/supported by: Delphine Lariviere (Penn State University)
Bioinformatics Tools to Track & Visualize Your Own SC2 Variants Over Time

4. Performing phylogenetic analysis of SC2 datasets

The tools listed below perform phylogenetic analyses of different complexity, ranging from web-apps to command-line tools that need to run on HPC facilities. The selected tools are integrated with visualization features that facilitate the interrogation of the results, but beware that such inferences might be uncertain and often require careful interpretation.

Public Health Resources Performing Global SC2 Phylogenetic Analysis
  • NextStrain
    • Brief Description: Nextstrain is an open-source project to harness the scientific and public health potential of pathogen genome data.
    • Developed/supported by: Fred Hutch/Basel (Nextstrain team)
    • User base: USA based groups
    • Documentation: docs
    • Help/community/discussion: discussion.nextstrain.org
    • Implementations for compute steps ("augur"):
  • Microreact
    • Brief Description: Open data visualization and sharing for genomic epidemiology
    • Developed/supported by: Centre for Genomic Pathogen Surveillance (CGPS)
    • User base: COG-UK, New Zealand, etc
    • User-interface: Web application / centrally hosted service
Offlineable Browser-Based Web Applications
  • Auspice
  • MicrobeTrace
    • Brief Description: The Visualization Multitool for Molecular Epidemiology and Bioinformatics
    • Developed/supported by: US CDC
    • Documentation: https://github.com/CDCgov/MicrobeTrace
    • User-interface: offlineable browser-based web app
  • UShER
    • Brief Description: Places user provided sequences on very large reference trees, extracts the relevant subtree, and provides a visualization
    • Developed/supported by: UCSC
    • User-interface: offlineable browser-based web app
Command-line interface (CLI) Tools
  • Grinch

    • Brief Description: Generates reports for the international distribution of PANGO lineages that can be viewed in a web browser.
    • Developed/supported by: PANGO, cov-lineages
    • User-interface: command-line tool
  • Phylopipe

    • Brief Description: Generates a downsampled global tree using FastTree and updates it daily using UShER, cleans and annotates the tree; can be run on output from Datapipe.
    • Developed/supported by: Virus Group (University of Edinburgh)
    • User-interface: command-line tool, nextflow pipeline
    • User base: COG-UK