Skip to content

Latest commit

 

History

History
155 lines (128 loc) · 16.5 KB

inputs.md

File metadata and controls

155 lines (128 loc) · 16.5 KB

A quick introduction to input variables in WDL

There are two kinds of input user-accessible variables in WDL: Workflow-level inputs and task-level inputs. If you are using Terra, you probably don't need to know anything about the difference between them except that task-level inputs get alphabetically sorted in Terra's UI below workflow-level inputs.

This pipeline uses a lot of external tools, and I tend to WDLize every possible input variable, so there are a lot of input variables in this pipeline. The vast majority of them are optional. What's most important is your fastqs.

Workflow-level inputs

FASTQs

Each version of myco has a slightly different way of inputting FASTQs. A basic explanation for each workflow is in the table below. You can find more detailed explanations in each workflow's workflow-level readme.

name type workflow description
biosample_accessions File myco_sra File of BioSample accessions to pull, one accession per line
paired_decontaminated_fastq_sets Array myco_simple Nested array of decontaminated and merged fastq pairs. Each inner array represents one sample; each sample needs precisely one forward read and one reverse read.
paired_fastq_sets Array myco_raw Nested array of paired fastqs, each inner array representing one samples worth of paired fastqs

Regardless of which version of myco you use, please make sure your FASTQs:

  • is Illumina paired-end data
  • is grouped per-sample
  • len(quality scores) = len(nucleotides) for every line
  • is actually MTBC
    myco_sra.wdl is able to detect these issues and will throw out those samples without erroring. Other forms of myco are not able to detect these issues. It is recommend that you also keep an eye on the total size of your FASTQs. Individual files over subsample_cutoff (default450 MB, -1 disables this check) will be downsampled, but keep an eye on the cumulative size of samples. For example, a sample like SAMEA968096 has 12 run accessions associated with it. Individually, none of these run accessions' FASTQs are over 1 GB in size, but the sum total of these FASTQs could quickly fill up your disk space. (You probably should not be using SAMEA968096 anyway because it is in sample group, which can cause other issues.)

myco_simple expects that the FASTQs you are putting into have already been cleaned and decontaminated, but this isn't a hard requirement. What is a hard requirement is that you have precisely one forward read and one reverse read per sample -- if you have multi-lane samples across various fastqs, they will need to be merged first.

Quality control

NOTE: This section is outdated and does not reflect latest changes. It is still here as the pipeline is undergoing a major refactoring. Documentation will be updated once inputs are more "stable"

name type myco_sra default description
covstatsQC_minimum_coverage Float 10 If covstats thinks coverage is below this, throw out this sample
covstatsQC_max_percent_unmapped Float 2 If covstats thinks this percentage (50 = 50%) of data does not map to H37Rv, throw out this sample
covstatsQC_skip_entirely Boolean false Should we avoid running covstats? Does not affect other forms of QC.
diffQC_max_percent_low_coverage Float 0.05 (myco_sra), 0.20 (myco_raw) If more than this percent (0.5 = 50%) of a sample's sites get masked for being below diff_min_coverage_per_site, throw out the whole sample.
diff_min_coverage_per_site Int 10 Positions with coverage below this value will be masked in diff files
early_qc_apply_cutoffs Boolean false If true, run fastp + TBProfiler on decontaminated fastqs and apply cutoffs to determine which samples should be thrown out.
early_qc_cutoff_q30 Float 0.9 Decontaminated samples with less than this percentage (as float, 0.5 = 50%) of reads above qual score of 30 will be discarded iff early_qc_apply_cutoffs is also true.
earlyQC_skip_entirely Boolean false Do not run early QC (fastp + fastq-TBProfiler) at all. Does not affect whether or not TBProfiler is later run on bams. Overrides early_qc_apply_cutoffs.
fastqc_on_timeout Boolean false (myco_sra only) If true, fastqc one read from a sample when decontamination or variant calling times out

Note that all forms of QC will throw out entire samples, with two exceptions:

  • diff_min_coverage_per_site works on a per-site basis, masking individual sites below that value but not throwing out the entire sample (unless so many sites get masked that diffQC_max_percent_low_coverage kicks in)
  • fastqc_on_timeout will run fastQC if true, but does not parse its output - the samples have already been discarded upstream when they timed out

Timeouts

NOTE: In myco_raw, this is now handled by guardrail mode. When working with data of unknown quality, it can be helpful to quickly remove samples that are likely low-quality. While developing myco on SRA data, we noticed that if a given sample took an unusually long time in the decontamination or variant calling step, they were likely to end up filtered out by the final quality control steps of the pipeline. This is especially true of the decontamination step -- the more contamination a sample has, the more that step has to do. This heuristic was defined on the default runtime attributes and using Terra as a backend, so straying from those defaults is likely to make the default timeout values less useful. This includes changing from SDDs to HDDs!

name type myco_sra default description
timeout_decontam_part1 Int 20 Discard any sample that is still running in clockwork map_reads after this many minutes (set to 0 to never timeout
timeout_decontam_part2 Int 15 Discard any sample that is still running in clockwork rm_contam after this many minutes (set to 0 to never timeout)
timeout_variant_caller Int 120 Discard any sample that is still running in clockwork variant_call_one_sample after this many minutes (set to 0 to never timeout)

myco_raw and myco_simple default to not using this heuristic at all, so their defaults are 0.

Variant caller inputs

NOTE: In myco_raw, these are once again task-level attributes. Usually, I write WDLs in a way that makes their runtime attributes and rarely-used optional arguments task-level, and everything else workflow-level. However, myco uses some workarounds that require it to technically have three copies of the variant caller task, which means that if I didn't make the variant caller's inputs workflow-level, there would be three sets of task-level inputs for the variant caller.

name type myco_sra default description
variantcalling_addl_disk Int 100 Additional disk size, in GB, on top of auto-scaling disk size.
variantcalling_cpu Int 16 Number of CPUs (cores) to request from GCP.
variantcalling_crash_on_error Boolean false If this task errors out, should it stop the whole pipeline (true), or should we just discard this sample and move on (false)? Note that errors that crash the VM (such as running out of space on a GCP instance) will stop the whole pipeline regardless of this setting.
variantcalling_crash_on_timeout Boolean false If this task times out, should it stop the whole pipeline (true), or should we just discard this sample and move on (false)?
variantcalling_debug Boolean false Do not clean up any files and be verbose
variantcalling_mem_height Int? cortex mem_height option. Must match what was used when reference_prepare was run (in other words do not set this variable unless you are also adjusting the reference preparation task)
variantcalling_memory Int 32 Amount of memory, in GB, to request from GCP.
variantcalling_preemptibles Int 1 How many times should this task be attempted on a preemptible instance before running on a non-preemptible instance?
variantcalling_retries Int 1 How many times should we retry this task if it fails after it exhausts all uses of preemptibles?
variantcalling_ssd Boolean true If true, use SSDs for this task instead of HDDs

Miscellanous workflow-level inputs

NOTE: This section is outdated and does not reflect latest changes. It is still here as the pipeline is undergoing a major refactoring. Documentation will be updated once inputs are more "stable"

name type default description
diff_force Boolean false If true and if decorate_tree is false, generate diff files. (Diff files will always be created if decorate_tree is true.)
diffQC_mask_bedfile File? this CRyPTIC mask file Bed file of regions to mask when making diff files
quick_tasks_disk_size Int 10 If a fastq file is larger than than size in MB, subsample it with seqtk (set to -1 to disable)isk size in GB to use for quick file-processing tasks; increasing this might slightly speed up file localization
subsample_cutoff Int 450 If a fastq file is larger than than size in MB, subsample it with seqtk (set to -1 to disable)
subsample_seed Int 1965 Seed used for subsampling with seqtk
tbprofiler_on_bam Boolean varies If true, run TBProfiler on BAMs
tree_decoration Boolean false Should usher, taxonium, and NextStrain trees be generated?
tree_to_decorate File? this draft tree Base tree to use if decorate_tree = true

Defaults to true for myco_sra and false for myco_raw for historical reasons

Task-level inputs

Many of these settings just change the name of output files or runtime attributes, but for the sake of ease-of-use we wanted to include them on a single list in the alphabetical order they show up in on Terra. As such, settings that DO NOT relate to filename outputs nor runtime attributes are in bold.

For more info on runtime settings, see https://cromwell.readthedocs.io/en/stable/RuntimeAttributes/

task name type default description
check_fastqs output_fastps_cleaned_fastqs Boolean false Use fastp's cleaned fastqs for subsequent tasks (true) instead of discarding them (false). Setting this to false means you are effectively only using fastp to check if a sample is valid, keeping or throwing out the entire sample based on this information.
collate_depth disk_size Int 10 Disk size, in GB. This task cannot autoscale as it cannot anticipate the size of reads from SRA.
collate_resistance disk_size Int 10 Disk size, in GB. This task cannot autoscale as it cannot anticipate the size of reads from SRA.
collate_strains disk_size Int 10 Disk size, in GB. This task cannot autoscale as it cannot anticipate the size of reads from SRA.
decontam_each_sample addldisk Int 100 Additional disk size, in GB, on top of auto-scaling disk size.
decontam_each_sample cpu Int 8 Number of CPUs (cores) to request from GCP.
decontam_each_sample contam_out_1 String?
decontam_each_sample contam_out_2 String?
decontam_each_sample counts_out String?
decontam_each_sample crash_on_timeout Boolean false If this task times out, should it stop the whole pipeline (true), or should we just discard this sample and move on (false)?
decontam_each_sample done_file String?
decontam_each_sample memory Int 16 Amount of memory, in GB, to request from GCP.
decontam_each_sample no_match_out_1 String?
decontam_each_sample no_match_out_2 String?
decontam_each_sample preempt Int 1 How many times should this task be attempted on a preemptible instance before running on a non-preemptible instance?
decontam_each_sample ssd Boolean true If true, use SSDs for this task instead of HDDs
decontam_each_sample subsample_cutoff Int -1 If a FASTQ file is larger than than size in MB, subsample it with seqtk (set to -1 to disable)
decontam_each_sample subsample_seed Int 1965 Seed used for subsampling with seqtk
decontam_each_sample threads Int? Try to use this many threads for decontamination. Note that actual number of threads also relies on your hardware.
decontam_each_sample verbose Boolean true Enable/Disable debug mode
FastqcWF limits File? Limits file defining fastQC cutoffs
get_sample_IDs preempt Int 1 How many times should this task be attempted on a preemptible instance before running on a non-preemptible instance?
make_mask_and_diff addldisk Int 10 Additional disk size, in GB, on top of auto-scaling disk size.
make_mask_and_diff cpu Int 8 Number of CPUs (cores) to request from GCP.
make_mask_and_diff histograms Boolean false Should coverage histograms be output?
make_mask_and_diff memory Int 16 Amount of memory, in GB, to request from GCP.
make_mask_and_diff preempt Int 1 How many times should this task be attempted on a preemptible instance before running on a non-preemptible instance?
make_mask_and_diff retries Int 1 How many times should we retry this task if it fails after it exhausts all uses of preemptibles?
merge_reports disk_size Int 10 Disk size, in GB. This task cannot autoscale as it cannot anticipate the size of reads from SRA.
profile bam_suffix String?
profile cpu Int 2 Number of CPUs (cores) to request from GCP.
profile memory Int 4 Amount of memory, in GB, to request from GCP.
profile preempt Int 1 How many times should this task be attempted on a preemptible instance before running on a non-preemptible instance?
profile ssd Boolean false If true, use SSDs for this task instead of HDDs
pull disk_size Int 100 Disk size, in GB. This task cannot autoscale as it cannot anticipate the size of reads from SRA.
pull preempt Int 1 How many times should this task be attempted on a preemptible instance before running on a non-preemptible instance?
trees detailed_clades Boolean false If true, run usher sampled diff with -D flag
trees in_prefix_summary String?
trees make_nextstrain_subtrees Boolean true Should our Nextstrain (Auspice-compatiable) output tree consist of multiple subtrees (true), or just one big tree (false)? Setting this to true might be useful if you have trouble loading very large trees into Auspice.
trees metadata_tsv File? Metadata TSV to annotate the input tree with
trees out_diffs String '_combined'
trees out_prefix String 'tree'
trees out_prefix_summary String?
trees out_tree_annotated_pb String '_annotated'
trees out_tree_nextstrain String '_auspice'
trees out_tree_nwk String '_nwk'
trees out_tree_raw_pb String '_raw'
trees out_tree_taxonium String '_taxonium'
trees reroot_to_this_node String? ID of the node you want to reroot to. Usually, this ID is a BioSample accession.
trees subtree_only_new_samples Boolean true Iff make_nextstrain_subtrees is true, the subtrees are focused only on newly-added samples (eg, samples that went through this run of the pipeline, rather than samples that were already on the pre-existing tree)
trees summarize_input_mat Boolean true Run matutils summary on the input tree before adding new samples to it, and output that summary