Nextflow handles job submissions on SLURM or other environments, and supervises running the jobs. Thus the Nextflow process must run until the pipeline is finished. We recommend that you put the process running in the background through screen
/ tmux
or similar tool. Alternatively you can run nextflow within a cluster job submitted your job scheduler.
It is recommended to limit the Nextflow Java virtual machines memory. We recommend adding the following line to your environment (typically in ~/.bashrc
or ~./bash_profile
):
NXF_OPTS='-Xms1g -Xmx4g'
The typical command for running the pipeline is as follows:
nextflow run main.nf --readPathsFile data/read_pathes_GEUVADIS_GBR_20samples.tsv -profile singularity
Note that the pipeline will create the following files in your working directory:
work # Directory containing the nextflow working files
results # Finished results (configurable, see below)
.nextflow_log # Log file from Nextflow
# Other nextflow hidden files, eg. history of pipeline runs and old logs.
Enables quantification of gene expression
Enables quantification of transcript usage
Enables quantification of transcriptional event expression (TxRevise)
Enables quantification of exon expression
Enables quantification of intron-slicing expression
Note: If none of the run_[quantification method]
is set to true
pipeline will only align the reads and stop the pipeline execution
Use this parameter to choose a configuration profile. Each profile is designed for a different compute environment - follow the links below to see instructions for running on that system. Available profiles are:
docker
- A generic configuration profile to be used with Docker
- Runs using the
local
executor and pulls software from dockerhub:nfcore/rnaseq
uppmax
,uppmax_modules
,uppmax_devel
- Designed to be used on the Swedish UPPMAX clusters such as
milou
,rackham
,bianca
andirma
- See
docs/configuration/uppmax.md
- Designed to be used on the Swedish UPPMAX clusters such as
hebbe
- Designed to be run on the c3se Hebbe cluster in Chalmers, Gothenburg.
- See
docs/configuration/c3se.md
binac
,cfc
- Profiles for clusters at QBiC in Tübingen, Germany
- See
docs/configuration/qbic.md
awsbatch
- Profile for running on AWSBatch, specific parameters are described below
aws
- A starter configuration for running the pipeline on Amazon Web Services. Uses docker and Spark.
- See
docs/configuration/aws.md
standard
- The default profile, used if
-profile
is not specified at all. Runs locally and expects all software to be installed and available on thePATH
. - This profile is mainly designed to be used as a starting point for other configurations and is inherited by most of the other profiles.
- The default profile, used if
none
- No configuration at all. Useful if you want to build your own config from scratch and want to avoid loading in the default
base
config profile (not recommended).
- No configuration at all. Useful if you want to build your own config from scratch and want to avoid loading in the default
Use this to specify the location of your input FastQ files. For example:
--readPathsFile 'path/to/data/file.tsv
This file should have 3 columns for pair-end data and 2 columns for single-end data. Make sure the separator between the column is a tab and not a white-space Please see the example of the file here
By default, the pipeline expects paired-end data. If you have single-end data, you need to specify --singleEnd
on the command line when you launch the pipeline.
--singleEnd
Three command line flags / config parameters set the library strandedness for a run:
--forward_stranded
--reverse_stranded
--unstranded
If not set, the pipeline will be run as unstranded.
You can set a default in a cutom Nextflow configuration file such as one saved in ~/.nextflow/config
(see the nextflow docs for more). For example:
params {
reverse_stranded = true
}
If you have a default strandedness set in your personal config file you can use --unstranded
to overwrite it for a given run.
These flags affect the commands used for several steps in the pipeline - namely HISAT2, featureCounts, leafcutter:
--forward_stranded
- HISAT2:
--rna-strandness F
/--rna-strandness FR
- featureCounts:
-s 1
- leafcutter:
1
- HISAT2:
--reverse_stranded
- HISAT2:
--rna-strandness R
/--rna-strandness RF
- featureCounts:
-s 2
- leafcutter:
2
- HISAT2:
By default, the pipeline uses gene_names
as additional gene identifiers apart from ENSEMBL identifiers in the pipeline.
This behaviour can be modified by specifying --fcExtraAttributes
when running the pipeline, which is passed on to featureCounts as an --extraAttributes
parameter.
See the user guide of the Subread package here.
Note that you can also specify more than one desired value, separated by a comma:
--fcExtraAttributes gene_id,...
The only supported aligner is HISAT2. Developed by the same group behind the popular Tophat aligner, HISAT2 has a much smaller memory footprint.
If you prefer, you can specify the full path to your reference genome when you run the pipeline:
--hisat2_index '[path to HISAT2 index]' \
--gtf_hisat2_index '[path to gtf file to build HISAT2 index]' \
--fasta '[path to Fasta reference]' \
--gtf_fc '[path to GTF file to be used by featureCounts]' \
--txrevise_gffs '[GFF reference files for txRevise]' \
--tx_fasta '[path to the Fasta file to be used by Salmon to quantify transcript usage]'
Supply this parameter to save any generated reference genome files to your results folder. These can then be used for future pipeline runs, reducing processing times.
By default, trimmed FastQ files will not be saved to the results directory. Specify this flag (or set to true in your config file) to copy these files when complete.
As above, by default intermediate BAM files from the alignment will not be saved. Set to true to also copy out BAM files from HISAT2 and sorting steps.
By default outputs of quantification for each individual will not be saved and only merged output quantification matrices will be saved. Set to true to also keep individual quantification files.
By default info and log files generated while quantification process will not be saved. Set to true to also keep info and log files.
The pipeline contains a large number of quality control steps. Sometimes, it may not be desirable to run all of them if time and compute resources are limited. The following options make this easy:
--skip_edger
- Skip edgeR MDS plot and heatmap
Each step in the pipeline has a default set of requirements for number of CPUs, memory and time. For most of the steps in the pipeline, if the job exits with an error code of 143
(exceeded requested resources) it will automatically resubmit with higher requests (2 x original, then 3 x original). If it still fails after three times then the pipeline is stopped.
Wherever process-specific requirements are set in the pipeline, the default value can be changed by creating a custom config file. See the files in conf
for examples.
Running the pipeline on AWS Batch requires a couple of specific parameters to be set according to your AWS Batch configuration. Please use the -awsbatch
profile and then specify all of the following parameters.
The JobQueue that you intend to use on AWS Batch.
The AWS region to run your job in. Default is set to eu-west-1
but can be adjusted to your needs.
Please make sure to also set the -w/--work-dir
and --outdir
parameters to a S3 storage bucket of your choice - you'll get an error message notifying you if you didn't.
The output directory where the results will be saved.
Name for the pipeline run. If not specified, Nextflow will automatically generate a random mnemonic.
This is used in the MultiQC report (if not default) and in the summary HTML / e-mail (always).
NB: Single hyphen (core Nextflow option)
Specify this when restarting a pipeline. Nextflow will used cached results from any pipeline steps where the inputs are the same, continuing from where it got to previously.
You can also supply a run name to resume a specific run: -resume [run-name]
. Use the nextflow log
command to show previous run names.
NB: Single hyphen (core Nextflow option)
Specify the path to a specific config file (this is a core NextFlow command).
NB: Single hyphen (core Nextflow option)
Use to set a top-limit for the default memory requirement for each process. Should be a string in the format integer-unit. eg. `--max_memory '8.GB'``
Use to set a top-limit for the default time requirement for each process.
Should be a string in the format integer-unit. eg. --max_time '2.h'
Use to set a top-limit for the default CPU requirement for each process.
Should be a string in the format integer-unit. eg. --max_cpus 1
Submit arbitrary cluster scheduler options (not available for all config profiles). For instance, you could use --clusterOptions '-p devcore'
to run on the development node (though won't work with default process time requests).
The bin
directory contains some scripts used by the pipeline which may also be run manually:
dexseq/*
- Script used to prepare annotation for exon expression
edgeR_heatmap_MDS.r
- edgeR script used in the Sample Correlation process