Skip to content

Latest commit

 

History

History
69 lines (53 loc) · 13.4 KB

README.md

File metadata and controls

69 lines (53 loc) · 13.4 KB

AWS Tutorial Resources


Overview of Page Contents


Biomedical Workflows on AWS

There are a lot of ways to run workflows on AWS. Here we list a few possibilities each of which may work for different research aims. As you walk through the various tutorials below, think about how you could possibly run that workflow more efficiently using a one of the other methods listed here. If you are unfamiliar with any of the terms or concepts here, please review the AWS 101 page.

  • The most simple is probably to spin up an EC2 instance, and run your command interactively, or using screen or, as a startup script attached as metadata. See the GWAS tutorial below for more info on how to run a pipeline using EC2.
  • You could also run your pipeline via a SageMaker notebook, either by splitting out each command as a different block, or by running a workflow manager (Nextflow etc.). See here about scheduling a notebook to let it run longer. You can find some example notebooks in the tutorials below.
  • If you are running bioinformatic workflows, you can leverage the serverless functionality of AWS using Amazon Omics. Read this blog for more detailed information and also see if any new blogs have come out. Being a new product, AWS is writing a lot of new content on the Omics service. They also have new managed workflows called Ready2Run workflows, which you can read more about here. Note that you can use the Omics service from the Console as well as with the APIs.
  • If you are using a workflow manager other than WDL or Nextflow (e. g. Snakemake), use AWS Genomics CLI, which is a wrapper for genomics workflow managers and AWS Batch (serverless computing cluster). See our docs on how to set up the AGC CLI for Cloud Lab.
  • Finally, one benefit of the cloud is access to GPUs for workflow acceleration. While a lot of focus on GPU implementation will focus on AI/ML workflows, NVIDIA has software called Parabricks that will accelerate genomic workflows for pretty low costs. See the full list of command line options here) to see if your specific workflow is accelerated. For specific details on how to use Parabricks within Cloud Lab see our guide.

For many of these tutorials, you will need Short Term Access Keys to create and use resources, particularly whenever a tutorial calls for "access key ID" and "secret key." Use this guide for an explanation of how to obtain and use Short Term Access Keys.

Please also note, GPU machines cost more than most CPU machines, so be sure to shut these machines down after use, or apply an EC2 lifecycle configuration. You may also encounter service quotas to protect you from the accidental use of expensive machine types. If that happens, and you still want to use a certain instance type, follow these instructions.

Download Data From the Sequence Read Archive (SRA)

Next Generation genetic sequence data is housed in the NCBI Sequence Read Archive (SRA). You can access these data using the SRA Toolkit. We walk you through this using this notebook, which also walks you through how to set up and search Athena tables to generate an accession list. You can also read this guide for more information on available dataset tables. Additional example notebooks can be found at this NCBI repo. In particular, we recommend this notebook(https://github.com/ncbi/ASHG-Workshop-2021/blob/main/3_Biology_Example_AWS_Demo.ipynb), which goes into more detail on using Athena to access the results of the SRA Taxonomic Analysis Tool, which often differ from the user input species name due to contamination, error, or due to samples being metagenomic in nature.

Genome Wide Association Studies

Genome-wide association studies (GWAS) are large-scale investigations that analyze the genomes of many individuals to identify common genetic variants associated with traits, diseases, or other phenotypes.

  • This NIH CFDE written tutorial walks you through running a simple GWAS using EC2. The tutorials asks you to select the Ohio region, make sure you change your region to N. Virginia otherwise you will have network issues. Note that the CFDE page has a few other bioinformatics related tutorials like BLAST and Illumina read simulation. We also converted the GWAS tutorial to a simplified notebook version if you prefer that format. See our notebook guide for help with setting up a Jupyter environment.

Medical Imaging Analysis

Medical imaging analysis requires the analysis of large image files and often requires elastic storage and accelerated computing.

  • Most medical imaging analyses are done using notebooks, so we would recommend accessing this Jupyter Notebook and cloning it into SageMaker. The tutorial walks through image segmentation.
  • AWS has a nice intro to Machine Learning in a SageMaker notebook that predicts breast cancer from features extracted from image data, which walks you through both image analysis and some of the ML functionality of SageMaker, the notebook is found here.
  • You can also view this AWS blog on how to annotate DICOM images and build a custom AI model with the data.
  • You can learn to deidentify medical images following this AWS tutorial.

RNAseq

RNA-seq analysis is a high-throughput sequencing method that allows the measurement and characterization of gene expression levels and transcriptome dynamics. Workflows are typically run using workflow managers, and final results can often be visualized in notebooks.

  • You can run this Nextflow tutorial for RNAseq a variety of ways on AWS. Following the instructions outlined above, you could use EC2, SageMaker, or AWS Batch(/docs/Genomics_Workflows.md).
  • For a notebook version of a complete RNAseq pipeline from Fastq to Salmon quantification from the King Lab of the University of Maine INBRE use this notebook, which we re-wrote to work on AWS. You can also use any of Ben King's excellent notebooks as well, but they are originally written for GCP.

Single Cell RNAseq

Single-cell RNA sequencing (scRNA-seq) is a technique that enables the analysis of gene expression at the individual cell level, providing insights into cellular heterogeneity, identifying rare cell types, and revealing cellular dynamics and functional states within complex biological systems.

  • This AWS blog lays out a potential method that integrates a lot of the AWS native tools for running an scRNAseq pipeline. It is less of a tutorial, and more of a demo of what is possible.
  • This NVIDIA blog details how to run an accelerated scRNAseq pipeline using RAPIDS. You can find a link to the GitHub that has lots of example notebooks here. For each example use case they show some nice benchmarking data with time and cost for each machine type. You will see that most runs cost less than $1.00 with GPU machines. If you want a CPU version that users Scanpy you can use this notebook. Pay careful attention to the environment setup as there are a lot of dependencies for these notebooks. Create a conda environment in the terminal, then run the notebook. Consider using mamba to speed up environment creation. We created a guide for conda environment set up as well.

ElasticBLAST

NCBI BLAST (Basic Local Alignment Search Tool) is a widely used bioinformatics program provided by the National Center for Biotechnology Information (NCBI) that compares nucleotide or protein sequences against a large database to identify similar sequences and infer evolutionary relationships, functional annotations, and structural information. The NCBI team has written a version of BLAST for the cloud called ElasticBLAST, and you can read all about it here. Essentially, ElasticBLAST helps you submit BLAST jobs to AWS Batch and write the results back to S3. Feel free to experiment with the example tutorial in Cloud Shell, or try our notebook version.

Long Read Sequence Analysis

Long read DNA sequence analysis involves analyzing sequencing reads typically longer than 10 thousand base pairs (bp) in length, compared with short read sequencing where reads are about 150 bp in length. Oxford Nanopore has a pretty complete offering of notebook tutorials for handling long read data to do a variety of things including variant calling, RNAseq, Sars-Cov-2 analysis and much more. Access the notebooks here. These notebooks expect you are running locally and accessing the epi2me notebook server. To run them in Cloud Lab, skip the first cell that connects to the server and then the rest of the notebook should run correctly, with a few tweaks. If you are just looking to try out notebooks, don't start with these. If you are interested in long read sequence analysis, then some troubleshooting may be needed to adapt these to the Cloud Lab environment. You may even need to rewrite them in a fresh notebook by adapting the commands. Feel free to reach out to our support team for help.

AI/ML Pipelines

Machine learning is a subfield of artificial intelligence that focuses on the development of algorithms and models that enable computers to learn from and make predictions or decisions based on data, without being explicitly programmed. Artificial intelligence and machine learning algorithms are being applied to a variety of biomedical research questions, ranging from image classification to genomic variant calling. AWS is moving all AI/ML workflows to SageMaker, the Juptyer notebook platform we have used a few times already. AWS has a very general tutorial here on how to build out an AI pipeline on SageMaker. You can also look at the breast cancer tutorial from the imaging section above for a more applied example. You can also submit a training job to SageMaker, and have your final model uploaded to S3 using PyTorch, Tensorflow or Apache MXNet.

Open Data

AWS has a lot of public data that you can integrate into your testing or use in your own research. You can access these datasets at the Registry of Open Data on AWS. There you can click on any of the datasets to view the S3 path to the data, as well as publications that have used those data and tutorials if available. To demonstrate, we can click the gnomad dataset, then get the S3 path and view the files at the command line by pasting https://registry.opendata.aws/broad-gnomad/.