-
Notifications
You must be signed in to change notification settings - Fork 0
3). Running
There is only one mandatory parameter for running SOMA, an input file (format detailed below).
./run_soma --input input.csv
The input file (e.g. 'input.tsv') is a five column tab-separated file with the following structure:
run_id,sample_id,sample_type,read1,read2,group
-
run_id: Run identifier, will determine the highest level directory name in the results directory
-
sample_id: Sample identifier, will determine the subdirectory where results are stored per-sample
-
sample_type: Sample description, will be added to the reports, but doesn't change how the sample is processed.
-
read1: Location of input forward read FASTQ files.
-
read2: Location of input reverse read FASTQ files.
-
group: Group ID, can be any string.
ℹ️ Input file formatting
- Any number of samples can be included provided they do not have both identical RUN_ID and SAMPLE_ID's.
- Inputs containing spaces should be enclosed in quotation marks (").
- Periods ('.') will automatically be replaced with underscores ('_') in the output.
run_id,sample_id,sample_type,read1,read2,group
RUN01,SAMPLE1,BLOOD,/data/reads/RUN01.SAMPLE_1_R1.BLOOD.fq.gz,/data/reads/RUN01.SAMPLE_1_R2.BLOOD.fq.gz,G1
RUN01,SAMPLE2,BLOOD,/data/reads/RUN01.SAMPLE_2_R1.BLOOD.fq.gz,/data/reads/RUN01.SAMPLE_2_R2.BLOOD.fq.gz,G1
RUN01,SAMPLE3,SALIVA,/data/reads/RUN01.SAMPLE_3_R1.SALIVA.fq.gz,/data/reads/RUN01.SAMPLE_3_R2.SALIVA.fq.gz,G1
RUN02,SAMPLE1,SKIN,/data/reads/RUN02.SAMPLE_1_R1.SKIN.fq.gz,/data/reads/RUN02.SAMPLE_1_R2.SKIN.fq.gz,G1
Further examples can be found here
.
Commonly used optional parameters include:
Help options:
--help Display help text.
--validationShowHiddenParams Display all parameters.
--print_modules Print the full list of available modules.
Run options:
-profile Executor to use. (accepted: singularity, apptainer, docker) [default: singularity]
-resume If possible resume the pipeline at the last completed process.
--outdir The output directory where the results will be saved (use absolute paths on cloud infrastructure). [default: results]
Resource usage:
--max_cpus Maximum number of CPUs that can be requested for any single job. [default: 36]
--max_memory Maximum amount of memory that can be requested for any single job. [default: 90.GB]
--max_time Maximum amount of time that can be requested for any single job. [default: 240.h]
Execution options
--skip_assembly Skip read assembly.
--skip_taxonomic_profiling Skip read-based taxonomic profiling.
--skip_bacterial_typing Skip metagenome assembled genome analyses.
When specified, the following parameters will skip substantial sections of the pipeline, saving resources if the results are not of interest:
--skip_assembly Skip read assembly.
--skip_taxonomic_profiling Skip read-based taxonomic profiling.
--skip_prokarya_typing Skip metagenome assembled genome analyses.
Excluding taxonomic databases will skip the associated step, reducing overall runtime.
--TAXONOMIC_PROFILING.krakendb="" Skip Kraken2 taxonomic profiling
--TAXONOMIC_PROFILING.centrifugerdb="" Skip Centrifuger taxonomic profiling
Depending on your available computing resources, it may be necessary to change the preset resource usage defaults. The max RAM and CPU usage can be changed with command line arguements as follows:
--max_cpus 24 Maximum number of CPUs that can be requested for any single job. [default: 36]
--max_memory "80.GB" Maximum amount of memory that can be requested for any single job. [default: 90.GB]
If you are finding that you are running out of memory, or if you have limited swap memory, it's possible to alter the preset resource usages for individual processes in conf/base.config
. By raising the RAM and CPU requirements for intensive processes to make the requirements >50% of the total CPU/RAM allocation, you can stop SOMA from running multiple intensive jobs simultaneously.
Specifically, in the section:
withLabel:process_high {
cpus = { check_max( 22 * task.attempt, 'cpus' ) }
memory = { check_max( 86.GB * task.attempt, 'memory' ) }
time = { check_max( 16.h * task.attempt, 'time' ) }
}
Skip geNomad neural network-based classification, this will reduce runtime at the cost of accuracy:
--GENOMAD_ENDTOEND.args="--disable-nn-classification"