Skip to content

Latest commit

 

History

History
168 lines (110 loc) · 5.44 KB

usage.md

File metadata and controls

168 lines (110 loc) · 5.44 KB

FLORA: Usage

Table of contents

Introduction

Nextflow handles job submissions on SLURM or other environments, and supervises running the jobs. Thus the Nextflow process must run until the pipeline is finished. We recommend that you put the process running in the background through screen / tmux or similar tool. Alternatively you can run nextflow within a cluster job submitted your job scheduler.

It is recommended to limit the Nextflow Java virtual machines memory. We recommend adding the following line to your environment (typically in ~/.bashrc or ~./bash_profile):

NXF_OPTS='-Xms1g -Xmx4g'

Install the pipeline

Local installation

How to install FLORA:

git clone https://gitlab.ifremer.fr/cn7ab95/FLORA.git

Running the pipeline

The most simple command for running the workflow is to use the provided PBS script as follows:

nextflow run main.nf

This will launch the workflow using local configurations.

For our usage, we adapt configuration to our supercomputer and we launch the workflow with our scheduler:

qsub run-main.nf

Note that the pipeline will create the following files in your working directory:

$SCRACTH/flora_workdir		# Directory containing the nextflow working files
$PWD/results			# Finished results (configurable, see below)
$PWD/.nextflow_log		# Log file from Nextflow
# Other nextflow hidden files, eg. history of pipeline runs and old logs.

Updating the pipeline

When you run the above command, Nextflow automatically runs the pipeline code from your git clone - even if the pipeline has been updated since. To make sure that you're running the latest version of the pipeline, make sure that you regularly update the version of the pipeline:

cd FLORA
git pull

Reproducibility

It's a good idea to specify a pipeline version when running the pipeline on your data. This ensures that a specific version of the pipeline code and software are used when you run your pipeline. If you keep using the same tag, you'll be running the same version of the pipeline, even if there have been changes to the code since.

First, go to the FLORA releases page and find the latest version number (eg. v1.0.0).

cd FLORA
git checkout v1.0.0

Mandatory arguments

--rawdata

Path to the RNAseq raw data files in FASTQ format.

--rrna_db

Path to the Bowtie2 index of the SILVA rRNA database.

--samples_file

Path to text file a that describes the data (condition, replicate) like the following example:

cond_A	cond_A_rep1	reads_A_rep1_R1.fq	reads_A_rep1_R2.fq
cond_A	cond_A_rep2	reads_A_rep2_R1.fq	reads_A_rep2_R2.fq
cond_B	cond_B_rep1	reads_B_rep1_R1.fq	reads_B_rep1_R2.fq
cond_B	cond_B_rep2	reads_B_rep2_R1.fq	reads_B_rep2_R2.fq

--min_length

The minimum length of kept reads after quality trimming.

--min_quality

The minimum quality of bases in each reads.

--stringency

The overlap with adapter sequence required to trim a sequence.

--error_rate

The maximum allowed error rate.

Job resources

Each step in the pipeline has a default set of requirements for number of CPUs, memory and time. For most of the steps in the pipeline, if the job exits with an error code of 143 (exceeded requested resources) it will automatically resubmit with higher requests (2 x original, then 3 x original). If it still fails after three times then the pipeline is stopped.

Other command line parameters

--outdir

The output directory where the results will be published.

-w/--work-dir

The temporary directory where intermediate data will be written.

-name

Name for the pipeline run. If not specified, Nextflow will automatically generate a random mnemonic.

-resume

Specify this when restarting a pipeline. Nextflow will used cached results from any pipeline steps where the inputs are the same, continuing from where it got to previously.

You can also supply a run name to resume a specific run: -resume [run-name]. Use the nextflow log command to show previous run names.

NB: Single hyphen (core Nextflow option)

--max_memory

Use to set a top-limit for the default memory requirement for each process. Should be a string in the format integer-unit. eg. --max_memory '8.GB'

--max_time

Use to set a top-limit for the default time requirement for each process. Should be a string in the format integer-unit. eg. --max_time '2.h'

--max_cpus

Use to set a top-limit for the default CPU requirement for each process. Should be a string in the format integer-unit. eg. --max_cpus 1