Skip to content

3). Running

Duncan Berger edited this page Nov 11, 2024 · 2 revisions

Table of contents

Usage

There is only one mandatory parameter for running SOMA, an input file (format detailed below).

./run_soma --input input.csv

Input file structure

The input file (e.g. 'input.tsv') is a five column tab-separated file with the following structure:

run_id,sample_id,sample_type,read1,read2,group
  • run_id: Run identifier, will determine the highest level directory name in the results directory

  • sample_id: Sample identifier, will determine the subdirectory where results are stored per-sample

  • sample_type: Sample description, will be added to the reports, but doesn't change how the sample is processed.

  • read1: Location of input forward read FASTQ files.

  • read2: Location of input reverse read FASTQ files.

  • group: Group ID, can be any string.

    ℹ️ Input file formatting

    • Any number of samples can be included provided they do not have both identical RUN_ID and SAMPLE_ID's.
    • Inputs containing spaces should be enclosed in quotation marks (").
    • Periods ('.') will automatically be replaced with underscores ('_') in the output.

Example input file:

run_id,sample_id,sample_type,read1,read2,group
RUN01,SAMPLE1,BLOOD,/data/reads/RUN01.SAMPLE_1_R1.BLOOD.fq.gz,/data/reads/RUN01.SAMPLE_1_R2.BLOOD.fq.gz,G1
RUN01,SAMPLE2,BLOOD,/data/reads/RUN01.SAMPLE_2_R1.BLOOD.fq.gz,/data/reads/RUN01.SAMPLE_2_R2.BLOOD.fq.gz,G1
RUN01,SAMPLE3,SALIVA,/data/reads/RUN01.SAMPLE_3_R1.SALIVA.fq.gz,/data/reads/RUN01.SAMPLE_3_R2.SALIVA.fq.gz,G1
RUN02,SAMPLE1,SKIN,/data/reads/RUN02.SAMPLE_1_R1.SKIN.fq.gz,/data/reads/RUN02.SAMPLE_1_R2.SKIN.fq.gz,G1

Further examples can be found here.

Optional parameters

Commonly used optional parameters include:

Help options:
  --help                                       Display help text.
  --validationShowHiddenParams                 Display all parameters.
  --print_modules                              Print the full list of available modules.

Run options:
  -profile                                     Executor to use. (accepted: singularity, apptainer, docker) [default: singularity] 
  -resume                                      If possible resume the pipeline at the last completed process. 
  --outdir                                     The output directory where the results will be saved (use absolute paths on cloud infrastructure). [default: results] 
   
Resource usage:
  --max_cpus                                   Maximum number of CPUs that can be requested for any single job. [default: 36]
  --max_memory                                 Maximum amount of memory that can be requested for any single job. [default: 90.GB]
  --max_time                                   Maximum amount of time that can be requested for any single job. [default: 240.h]

Execution options
  --skip_assembly                              Skip read assembly.
  --skip_taxonomic_profiling                   Skip read-based taxonomic profiling.
  --skip_bacterial_typing                      Skip metagenome assembled genome analyses.

Tips for improving speed and efficiency

Skipping major analysis steps

When specified, the following parameters will skip substantial sections of the pipeline, saving resources if the results are not of interest:

  --skip_assembly                                 Skip read assembly.
  --skip_taxonomic_profiling                      Skip read-based taxonomic profiling.
  --skip_prokarya_typing                          Skip metagenome assembled genome analyses.

Skipping read-based taxonomic annotation

Excluding taxonomic databases will skip the associated step, reducing overall runtime.

  --TAXONOMIC_PROFILING.krakendb=""               Skip Kraken2 taxonomic profiling
  --TAXONOMIC_PROFILING.centrifugerdb=""          Skip Centrifuger taxonomic profiling

Adjust RAM/CPU usage

Depending on your available computing resources, it may be necessary to change the preset resource usage defaults. The max RAM and CPU usage can be changed with command line arguements as follows:

  --max_cpus 24                                   Maximum number of CPUs that can be requested for any single job. [default: 36]
  --max_memory "80.GB"                            Maximum amount of memory that can be requested for any single job. [default: 90.GB]

If you are finding that you are running out of memory, or if you have limited swap memory, it's possible to alter the preset resource usages for individual processes in conf/base.config. By raising the RAM and CPU requirements for intensive processes to make the requirements >50% of the total CPU/RAM allocation, you can stop SOMA from running multiple intensive jobs simultaneously.

Specifically, in the section:

withLabel:process_high {
   cpus   = { check_max( 22    * task.attempt, 'cpus'    ) }
   memory = { check_max( 86.GB * task.attempt, 'memory'  ) }
   time   = { check_max( 16.h  * task.attempt, 'time'    ) }
}

Other parameters

Skip geNomad neural network-based classification, this will reduce runtime at the cost of accuracy:

  --GENOMAD_ENDTOEND.args="--disable-nn-classification"
Clone this wiki locally