Skip to content

Commit

Permalink
Updated to containerized nextflow pipeline
Browse files Browse the repository at this point in the history
  • Loading branch information
msauria committed Jul 18, 2024
1 parent 5fd728f commit e219c6a
Show file tree
Hide file tree
Showing 8 changed files with 410 additions and 30 deletions.
32 changes: 32 additions & 0 deletions .github/workflows/build_docker.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Build dockerfile on change
name: Build Docker (env/beagle.Dockerfile)

on:
push:
paths:
- 'env/beagle.Dockerfile'
- '.github/workflows/build_docker.yml'
pull_request:
paths:
- 'env/beagle.Dockerfile'
- '.github/workflows/build_docker.yml'

jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2

# Build Tools
- name: Build and Publish
uses: elgohr/Publish-Docker-Github-Action@master
with:
name: andersenlab/beagle
tag: "${{ steps.current-time.formattedTime }}"
username: ${{ secrets.KSE_DOCKER_USER }}
password: ${{ secrets.KSE_DOCKER_PASS }}
snapshot: true
dockerfile: beagle.Dockerfile
workdir: "env"
tags: "latest"
cache: true
136 changes: 135 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1 +1,135 @@
# impute-nf
# VCF Imputation

The [impute-nf](https://github.com/AndersenLab/impute-nf) pipeline subsets isotype reference strains from a hard-filter vcf file, creates a SNV-only VCF, and imputes a new VCF. This step is required for fine-mapping with NemaScan.

This page details how to run the pipeline.

# Pipeline overview

```
########## # ##
## ##### # #
## # #
## # ## ## ### # # # ## # ## ####
## # # # # # # # # # # ### # # #
## # # # # # # # # ### # # #
## # # # # # # # # # # # #
########## # # # ### ### # ### # # #
#
#
#
parameters description Set/Default
========== =========== ========================
--debug Set to 'true' to test (optional)
--species Species: 'c_elegans', 'c_tropicalis' or 'c_briggsae' (required)
--vcf hard filtered vcf to calculate variant density (required)
--out output folder name (optional)
--chrI | chrII | chrIII... Window and overlap for each chromosome (optional)
```

## Software Requirements

* The latest update requires Nextflow version 23+. On Rockfish, you can access this version by loading the `nf23_env` conda environment prior to running the pipeline command:

```
module load python/anaconda
source activate /data/eande106/software/conda_envs/nf23_env
```

### Relevant Docker Images

*Note: Before 20220301, this pipeline was run using existing conda environments on QUEST. However, these have since been migrated to docker imgaes to allow for better control and reproducibility across platforms. If you need to access the conda version, you can always run an old commit with `nextflow run andersenlab/post-gatk-nf -r 20220216-Release`*

* `andersenlab/beagle:5.2` ([link](https://hub.docker.com/r/andersenlab/beagle)): Docker image is created within this pipeline using GitHub actions. Whenever a change is made to `env/beagle.Dockerfile` or `.github/workflows/build_beagle_docker.yml` GitHub actions will create a new docker image and push if successful

Make sure that you add the following code to your `~/.bash_profile`. This line makes sure that any singularity images you download will go to a shared location on `/vast/eande106` for other users to take advantage of (without them also having to download the same image).

```
# add singularity cache
export SINGULARITY_CACHEDIR='/vast/eande106/singularity/'
```

>[!Note]
>If you need to work with the docker container, you will need to create an interactive session as singularity can't be run on Rockfish login nodes.
>
>```
>interact -n1 -pexpress
>module load singularity
>singularity shell [--bind local_dir:container_dir] /vast/eande106/singularity/<image_name>
>```
# Usage
*Note: if you are having issues running Nextflow or need reminders, check out the [Nextflow](http://andersenlab.org/dry-guide/rockfish/rf-nextflow/) page.*
## Testing on Rockfish
*This command uses a test dataset*
```
nextflow run -latest andersenlab/impute-nf --debug
```
## Running on Rockfish
You should run this in a screen or tmux session.
```
nextflow run -latest andersenlab/impute-nf --vcf <path_to_vcf> --species <species>
```
# Parameters
## -profile
There are three configuration profiles for this pipeline.
* `rockfish` - Used for running on Rockfish (default).
* `quest` - Used for running on Quest.
* `local` - Used for local development.
>[!Note]
>If you forget to add a `-profile`, the `rockfish` profile will be chosen as default
## --debug
You should use `--debug` for testing/debugging purposes. This will run the debug test set (located in the `test_data` folder).
For example:
```
nextflow run -latest andersenlab/impute-nf --debug
```
## --species
Options: c_elegans, c_briggsae, or c_tropicalis
## --vcf
Path to the hard-filtered vcf output from [`wi-gatk`](https://github.com/AndersenLab/wi-gatk). VCF should contain **ALL** strains.
## --chrI|chrII|chrIII|chrIV|chrV|chrX|MtDNA (optional)
The window size and overlap to use as inputs to Beagle. These parameters have been checked and decided on by previous lab members and Erik. Some chromosomes might require a window size of 3 and an overlap of 1. In recent conversation with the person who manages Beagle, they mentioned we should probably use default values unless we have done simulations to show these values are better. Note for the future maybe.
## --out (optional)
__default__ - `impute-YYYYMMDD`
A directory in which to output results. If you have set `--debug`, the default output directory will be `impute-YYYYMMDD-debug`.
# Output
```
└── variation
   ├── WI.20240718.hard-filter.isotype.SNV.vcf.gz
   ├── WI.20240718.hard-filter.isotype.SNV.vcf.gz.tbi
  ├── WI.20240718.impute.isotype.SNV.vcf.gz
└── WI.20240718.impute.isotype.SNV.vcf.gz.tbi

```
44 changes: 44 additions & 0 deletions conf/local.config
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@

/*
LOCAL

For running the pipeline locally
*/

process {

withLabel: xs {
cpus = 2
memory = 1.GB
}

withLabel: sm {
cpus = 2
memory = 2.GB
}

withLabel: md {
cpus = 4
memory = 2.GB
}

withLabel: lg {
cpus = 4
memory = 2.GB
}

withLabel: ml {
cpus = 4
memory = 2.GB
}

withLabel: xl {
cpus = 4
memory = 4.GB
}

}

docker {
enabled = true
}
67 changes: 67 additions & 0 deletions conf/quest.config
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
/*
Quest Configuration
*/

params{
baseDir = '/projects/b1042/AndersenLab'
workDir = '/projects/b1042/AndersenLab/work'
dataDir = '/projects/b1042/AndersenLab/data'
softwareDir = '/projects/b1042/AndersenLab/software'
}

process {
executor = 'slurm'
queue = 'genomicsguestA'
errorStrategy='retry'
maxRetries=3

withLabel: xs {
clusterOptions = '-A b1042 -t 4:00:00 -e errlog.txt'
cpus = 1
memory = "4.GB"
}

withLabel: sm {
clusterOptions = '-A b1042 -t 4:00:00 -e errlog.txt'
cpus = 2
memory = "8.GB"
}

withLabel: md {
clusterOptions = '-A b1042 -t 4:00:00 -e errlog.txt'
cpus = 4
memory = "16.GB"
}

withLabel: ml {
clusterOptions = '-A b1042 -t 12:00:00 -e errlog.txt'
cpus = 16
memory = "64.GB"
}

withLabel: lg {
clusterOptions = '-A b1042 -t 24:00:00 -e errlog.txt'
cpus = 48
memory = "190.GB"
}

withLabel: xl {
clusterOptions = '-A b1042 -t 24:00:00 -e errlog.txt'
cpus = 48
memory = "1500.GB"
}
}

executor {
queueSize=500
submitRateLimit=10
}

singularity {
enabled = true
autoMounts = true
cacheDir = "${params.baseDir}/singularity"
pullTimeout = '20 min'
}


72 changes: 72 additions & 0 deletions conf/rockfish.config
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
/*
Rockfish Configuration
*/

params {
baseDir = '/vast/eande106'
workDir = '/vast/eande106/work'
dataDir = '/vast/eande106/data'
softwareDir = '/data/eande106/software'
}

process {
executor = 'slurm'
queueSize = 100

withLabel: xs {
clusterOptions = '-A eande106 -t 2:00:00 -e errlog.txt -N 1'
cpus = 1
memory = "4G"
queue = "shared"
}

withLabel: sm {
clusterOptions = '-A eande106 -t 2:00:00 -e errlog.txt -N 1'
cpus = 2
memory = "8G"
queue = "shared"
}

withLabel: md {
clusterOptions = '-A eande106 -t 2:00:00 -e errlog.txt -N 1'
cpus = 4
memory = "16G"
queue = "shared"
}

withLabel: ml {
clusterOptions = '-A eande106 -t 30:00:00 -e errlog.txt -N 1'
cpus = 16
memory = "64G"
queue = "shared"
}

withLabel: lg {
clusterOptions = '-A eande106 -t 2:00:00 -e errlog.txt -N 1 --ntasks-per-node 1 --cpus-per-task 48'
// cpus = 48
//memory = "190G"
queue = "parallel"
}

withLabel: xl {
clusterOptions = '-A eande106_bigmem -t 4:00:00 -e errlog.txt -N 1'
cpus = 48
memory = "1500G"
queue = "bigmem"
}

}

executor {
queueSize=500
submitRateLimit=10
}

singularity {
enabled = true
autoMounts = true
cacheDir = "${params.baseDir}/singularity"
pullTimeout = '20 min'
}


27 changes: 27 additions & 0 deletions env/beagle.Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
FROM openjdk:8-jre
MAINTAINER Mike Sauria <[email protected]>

RUN wget https://faculty.washington.edu/browning/beagle/beagle.28Jun21.220.jar -O /beagle.28Jun21.220.jar && \
echo "#!/bin/bash" > /usr/local/sbin/beagle && \
echo "java -Xmx98g -jar /beagle.28Jun21.220.jar \$*" >> /usr/local/sbin/beagle && \
chmod a+rx /usr/local/sbin/beagle

RUN apt-get --allow-releaseinfo-change update && \
apt-get install -y libbz2-dev libvcflib-tools libvcflib-dev procps autoconf automake make gcc \
perl zlib1g-dev libbz2-dev liblzma-dev libcurl4-gnutls-dev libssl-dev libncurses5-dev && \
rm -rf /var/lib/apt/lists/* && \
wget https://github.com/samtools/bcftools/releases/download/1.3.1/bcftools-1.3.1.tar.bz2 -O bcftools.tar.bz2 && \
tar -xjvf bcftools.tar.bz2 && \
cd bcftools-1.3.1 && \
make && \
make prefix=/usr/local/bin install && \
mv /usr/local/bin/bin/bcftools /usr/bin/bcftools && \
rm -rf /usr/local/bin/bin && \
cd /usr/local/bin && \
wget https://github.com/samtools/htslib/releases/download/1.9/htslib-1.9.tar.bz2 && \
tar -vxjf htslib-1.9.tar.bz2 && \
cd htslib-1.9 && \
make && \
mv bgzip ../ && \
cd ../ && \
rm -rf htslib-1.9
Loading

0 comments on commit e219c6a

Please sign in to comment.