- Purpose
- Dependencies
- GCP Commands
- GCP Architecture
- Machine Types
- Single VM
- Cloud SDK
- Kubernetes Cluster
- Life Sciences API
This workflow runs an RNAseq pipeline on GCP (Google Cloud Platform) that starts with raw fastq files and ends with transcript quantification tables generated by Salmon. The pipeline can be run three different ways on GCP: on a single Virtual Machine (VM), using the Google Life Sciences API, or distributed across a Kubernetes cluster. This pipeline depends on the workflow manager snakemake and conda distribution mamba to handle dependencies and run bioinformatic steps.
The only two dependencies that are needed for this pipeline to run are snakemake and mamba, both of which are installed by the shell script that launches the instance, or are handled by snakemake when running on the API or Kubernetes (which require local installation, more below). One great thing about snakemake is that it uses conda to manage all necessary dependencies by creating a conda environment for each step of the pipeline. This is nice because it means that you do not have to install any additional programs or tools to the VM, but it also gives you easy control over the version that is used for each step. Alternatively, if you wanted to run the pipeline interactively, rather than using a shell script, you will need to install mambaforge and snakemake before you can run the pipeline (which also applies to local install). These can be installed with the following commands.
# Install Mambaforge
curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh
bash Mambaforge-$(uname)-$(uname -m).sh -b -p $HOME/mambaforge
# Move conda/mamba to path, before you run this double check the path
export PATH="$HOME/mambaforge/bin:$PATH"
#check that the install and path are set up
which conda
# Install
mamba install -c conda-forge -c bioconda snakemake
GCP has a few classes of command line commands that relate to different functions in the cloud. The two big ones we will use here are gcloud
and gsutil
.
For example, gcloud commands control everything related Compute Engine, such as gcloud create
for starting a VM instance. Gcloud also launches kubernetes clusters as well. To control kubernetes clusters you would use the class of kubectl
commands.
Gsutil commands control everything related to cloud storage. For example gsutil cp
can copy files to and from a bucket.
This tutorial assumes that you have access to the GCP console and have already been assigned a project. If you do not have a projects, see this tutorial on how to set one up.
Although GCP has more tools and features than could be covered in 10+ tutorials, this RNAseq pipeline only leverages a few. For the single VM method we use Compute Engine and Cloud Storage. The Google Cloud Life Sciences API clearly uses the Cloud Life Sciences API Cloud Storage, and the Cloud SDK. Finally the Kubernetes method uses the Google Kubernetes Engine, Cloud Storage, and the Cloud SDK.
Most of the setup for all three methods involves getting the structure of files within cloud storage right. For the single VM solution, we will copy this 'directory' structure onto our virtual machine so that paths for necessary files are ready for Snakemake to find them. In actuality, Cloud Storage is blog/object storage, not folder/directory storage, so all items within the bucket (including the 'folders') are considered immutable objects, but you can still organize things into a hierarchy that can be replicated on the instance, or accessed remotely via the SDK.
For the sake of this tutorial we are going to create the following 'folder' organization. Note that the name of the bucket has to be globally unique. There are two ways to accomplish this directory structure. The first is to go into the Cloud Storage Console and manually make folders and add files. The other is to add files using gsutil with the desired path. For example, if I want to add my 'snakefile' to the scripts folder, but I have not made that directory yet, I could do gsutil cp snakefile gs://bucket/scripts/snakefile
and it would create that scripts directory for me.
GCP has a wide array of machine families available. In the general purpose category, E2 machines are cost-optimized and offer up to 32 vCPUs with up to 129 GB of memory with a maximum of 8 GB for vCPU. These machines tend to be baskets of machines and include some older processors, so they can be a bit slower than the N2 machines (but a lot cheaper). N2 VMs offer up to 80 vCPUs, 8 GB of memory per vCPU, and are available with the Intel Cascade Lake CPU platform. In layman's terms, this means that they are a lot faster, but will end up costing more. N2D machines offer up to 224 vCPUs, while N1 VMs offer up to 96 vCPUs. There are also workload specific machine types such as compute optimized (the fastest machines) C2 family, the memory-optimized M1 and M2 machines, and then the accelerator optimized family of A2 machines. Choosing a machine type should be a tradeoff between compute power, memory requirements, and cost. For most workflows the E2/N2 general purpose machines will work fine, but obviously some use cases may benefit from the other machine types. Sometimes the desired machine type is not available in your region. This is because either there was an outage in the region, or another customer is running a really big job. The best thing to do in this case is just pick another machine family and it will usually work fine. For example, if you wanted n2-standard-32, and it isn't available, you can switch to n2-standard-16, or to e2-standard-32.
The idea behind this method is that you launch a VM with a shell script that installs the necessary software, copies over the necessary files from Cloud Storage, runs the snakemake workflow, and then copies results back to Cloud Storage and shuts down the VM. You will need to modify the shell script called launch-instance.sh to have the paths of your bucket so that data gets copied over correctly. Your gcloud command will have metadata flags that are used by the shell script to set parameters such as the home directory and the number of threads. Note that when you attach a shell script to an instance it will run as root, so to start it will not be in your home directory. This has implications for where mambaforge is installed (as root) and how you append the path of the bin to your $PATH.
You can think about a VM as a brand new computer, it has very little pre-installed. This means that we either need to install programs and copy over files, or we can attach our VM to an existing disk image that has all our data and programs saved. There are cost and speed tradeoffs to both approaches, but generally it is best practice to keep your compute (VM) and storage (Cloud Storage) separate. The exception here may be when you have really large files, like a 100 GB 30x human genome, it may be faster to keep your fastq or BAM files stored on a disk image that your attach to when you launch a fresh VM. These decisions will end up being very project specific.
The gcloud command for launching the VM via a shell script is similar to the following, but make sure you change the variables to match your situation:
The easiest way to get this command for your project is to go to the console, then go to 'Compute Engine', at the top click 'Create Instance', then name the instance, chose the region, chose the machine type. For Boot Disk, click 'Change' and give it 100GB instead of 10. You can also change the operating system if desired. Under Access Scopes, select 'Allow full access to all Cloud APIs', otherwise you won't be able to access files in Cloud Storage. Then at the very bottom, click on 'command line' and you will have your gcloud command.
All you have to do before launching the instance is to add the metadata flag, --metadata startup-script-url=gs://BUCKET/scripts/launch-instance.sh,homedir=/home/USER
gcloud beta compute --project=$PROJECTNAME instances create $VM-NAME --zone=us-east4-c --machine-type=e2-standard-8 --subnet=default --network-tier=PREMIUM --maintenance-policy=MIGRATE --service-account=$SERVICEACCOUNTNAME --scopes=https://www.googleapis.com/auth/cloud-platform --image=debian-10-buster-v20210609 --image-project=debian-cloud --boot-disk-size=200GB --boot-disk-type=pd-balanced --boot-disk-device-name=$BOOTDISKNAME --no-shielded-secure-boot --shielded-vtpm --shielded-integrity-monitoring --reservation-affinity=any --metadata startup-script-url=gs://BUCKET/scripts/launch-instance.sh,homedir=/home/USER
To actually run this command, you have two options. The easier of the two is to open the cloud shell and paste in in. Make sure your startup script is in the bucket so that gcloud can find it. The other option is to run the gcloud command from your local terminal using Cloud SDK (see next section).
First, install the Cloud SDK following the instructions here.
Next, type gcloud init
and following the onscreen instructions. This starts with specifying a new or existing configuration, then the user, then the project, then the Region and Zone.
At that point you are all set up and can use regular GCP commands like gcloud or gsutil.
Kubernetes is an opensource application (invented by Google) that allows you to run containerized workflows in a distributed fashion. A Kubernetes cluster is similar to an on premises cluster that you may be familiar with, essentially you are tying together a bunch of compute nodes and distributing your process across them. A Kubernetes cluster is a group of containers, which are very similar to VMs except that they share a single operating system and are generally controlled centrally.
This method has several advantages over the single VM approach. First, you have a lot more computational power available. While a single VM will eventually reach a maximum size, a cluster can be scaled up or down as needed. Second, there is a lot less setup involved. You launch the cluster, and then send the snakemake job there. There are no shell scripts, no long and complex gcloud commands, no disk images. It is more simple.
In this method, and in the Life Sciences API described in the next section, compute and storage are kept completely separate. The application will copy files from Cloud Storage as needed to run the pipeline, and then move output files to Cloud Storage according to the directly structure outlined in the snakefile.
To run snakemake on Kubernetes, a few steps are involved. These can be run from the cloud shell, or a VM, or via the SDK. If using the SDK, make sure to follow the steps above, starting with gcloud init in order to connect to the project. We recommend using the SDK if possible.
Before we create the cluster, a bit of local setup is needed. First, make sure your Cloud Storage is set up correctly following the section GCP Cloud Storage Architecture
.
Next, you need to have the snakefile, the envs directory, and the config.yaml file all stored locally in a directory.
Make sure that the paths in the snakefile reflect the path to the bucket, rather than the local path, which is different from running snakemake on a single VM. For example, in the rule 'fastqc_raw', instead of saying outdir='qc/raw/fastqc', make sure to say outdir='Bucket/qc/raw/fastqc'
First set up your local directory to push to the cloud
git init
Now you are ready to initiate the cluster. Num-nodes = # containers in the cluster and machine type describes the specs for each container (think VM) as far as CPU and RAM are concerned. You can also specify the disk space for each container.
It is important that you specify machines with enough nodes to run all your snakemake rules, otherwise the workflow will fail. For example, if you tell snakemake to use 8 cores for bwa mem, and then you launch two containers with 2 cores each, the workflow will fail.
gcloud container clusters create $CLUSTERNAME --num-nodes=4 --scopes storage-rw --machine-type "e2-standard-16" --disk-size "200"
Next, authenticate the cluster. The second command will launch a new browser window.
gcloud container clusters get-credentials $CLUSTERNAME
gcloud auth application-default login
Next turn on versioning for your bucket so that it can overwrite files in there. If you do not do this, it will work the first time, but snakemake will throw an error on subsequent runs about not being able to find files if they already exist in the bucket. Versioning allows snakemake to just override the files.
gsutil versioning set on gs://BUCKETNAME
Finally, we can launch the snakemake job and point it to your cluster. Notice that the command is very similar to the one used on a single VM, but here we specify kubernetes to tell snakemake to launch on the cluster. For both kubernetes and the LS API below, we need to tell snakemake where to find the remote files in the Bucket with the --default-remote-prefix. Make sure if your bucket name is bucket, you say --default-remote-prefix bucket, not --default-remote-prefix gs://bucket. You do not need the gs:// prefix. -j is the number of jobs you want to run concurrently. This should be the number of cores available, so if your cluster is 4 containers with 16 cores, this can be as high as -j 64. You can see how kubernetes can really allow you to scale in this way.
snakemake --cores --kubernetes --use-conda --default-remote-provider GS --default-remote-prefix $BUCKETNAME -j 32 --forceall -p
Status updates will show up in your terminal but you can also go check on things in the GCP console under 'Kubernetes Engine' which can be reached via the main navigation on the left (hamburger menu/3 horizontal lines). On the left side of the Kubernetes Engine window, you will see 'Clusters' which will show you your cluster and any status updates, and then you can click on 'Workloads' to see the snakemake jobs running.
Snakemake will also give you the kubectl
commands for checking the log and status files.
The Life Sciences API is a suite of services and tools for managing, processing, and transforming life sciences data. You can think of it as a prepackaged VM that is capable of handling genomic workflows, without you having to set up any of the infrastructure. As long as your data is organized in Cloud Storage, you can just send a snakemake job straight to the LS API without any upfront work, and it will run your workflow. If the Kubernetes approach was more simple than the single VM approach, the LS API is simpler still. Similar to Kubernetes, you need to have the same files stored locally and put within a local git repository. You also need to have your Cloud Storage organized ahead of time. The same snakefile will work for both Kubernetes and LS API. In testing, this method was a bit slower than running on Kubernetes, with test data the workflow took about 2x as long on the LS API compared with Kubernetes. This may be an artifact of the LS API still being in beta and not being available in the same region as the Cloud Storage bucket, so there was more time spent copying data. This is something to consider, though easier to run, it may take a bit longer than Kubernetes, even given the same number of cores. However, once the API is available in more regions, this discrepancy may be reduced.
Again, similar to Kubernetes, a bit of setup is required the first time you run the LS API. First, you need to tell it what service account you are using. To do this, first go to 'IAM and Admin' in the console. On the left go to 'Service Accounts', then click on $[email protected] (the compute engine default service account). At the top click 'KEYS', then click 'ADD KEY', then 'CREATE NEW KEY'. This will download a json file. Make sure you put the downloaded file somewhere safe because you can not re-download the key for that service account. Now you can export the key to tell GCP to use the compute engine service account. Make sure you change the path and file name to match your file path.
export GOOGLE_APPLICATION_CREDENTIALS=$PATH/$FILE.json
Export your project name.
export GOOGLE_CLOUD_PROJECT=$PROJECTNAME
Next turn on versioning for your bucket so that it can overwrite files in there (if you haven't done it on the Kubernetes section). If you do not do this, it will work the first time, but snakemake will throw an error on subsequent runs about not being able to find files if they already exist in the bucket. Versioning allows snakemake to just override the files.
gsutil versioning set on gs://BUCKETNAME
Now, just run snakemake. Notice that it is very similar to the Kubernetes command, but you specify --google-lifesciences, and you also have to give it a --google-lifesciences-region. Because the API is in Beta, it is not available in all regions, but you can see what is available in the Life Sciences API section in the GCP Console. To find it, go to the main menu, and it says 'Life Sciences' under the 'BIG DATA' Section. You can also monitor the status of the run in that console in the same place, but the terminal also will have a lot of status updates that are helpful.
snakemake --google-lifesciences --default-remote-prefix $BUCKETNAME --use-conda --google-lifesciences-region us-central1 --default-resources "machine_type=n2-standard-32" --cores -j 32 --rerun-incomplete --forceall