diff --git a/doc/inputs.md b/doc/inputs.md index 1cd83e7..c25d895 100644 --- a/doc/inputs.md +++ b/doc/inputs.md @@ -34,7 +34,7 @@ myco_cleaned expects that the FASTQs you are putting into have already been clea | diff_min_coverage_per_site | Int | 10 | Positions with coverage below this value will be masked in diff files | | early_qc_apply_cutoffs | Boolean | false | If true, run fastp + TBProfiler on decontaminated fastqs and apply cutoffs to determine which samples should be thrown out. | | early_qc_cutoff_q30 | Float | 0.9 | Decontaminated samples with less than this percentage (as float, 0.5 = 50%) of reads above qual score of 30 will be discarded iff early_qc_apply_cutoffs is also true. | -| fastpQC_skip_entirely | Boolean | false | Do not run early QC (fastp + fastq-TBProfiler) at all. Does not affect whether or not TBProfiler is later run on bams. Overrides early_qc_apply_cutoffs. | +| earlyQC_skip_entirely | Boolean | false | Do not run early QC (fastp + fastq-TBProfiler) at all. Does not affect whether or not TBProfiler is later run on bams. Overrides early_qc_apply_cutoffs. | | fastqc_on_timeout | Boolean | false | (myco_sra only) If true, fastqc one read from a sample when decontamination or variant calling times out | Note that all forms of QC will throw out entire samples, with two exceptions: diff --git a/doc/qc_and_filtering.md b/doc/qc_and_filtering.md index 24a4df7..60d5d4f 100644 --- a/doc/qc_and_filtering.md +++ b/doc/qc_and_filtering.md @@ -13,11 +13,11 @@ If a FASTQ is above `subsample_cutoff` MB, it will get downsampled by seqtk. `su ### decontamination An entire WDL task of myco (except myco_cleaned) is dedicated just to decontaminating reads. The decontamination workflow starts with `clockwork map_reads` to map to a decontamination reference, and then uses `clockwork rm_contam` to generate decontaminated FASTQs. It is worth noting that how long a sample spends in this decontamination step roughly correlates with how much contamination is in it, but input file size is also a factor. If you're seeing a batch of samples that are roughly the same size (or subject to default downsampling settings) as typical, but take unusually long to decontaminate, that batch of samples might be considered suspect. -### fastpQC -fastpQC runs fastp as both a QC step and a trimming step. Earlier versions of myco had "earlyQC" which merged fastp and fastq-TBProfiler into one subworkflow, but after version 5.0.2 Ash realized it's less confusing to just seperate these out. +### earlyQC (aka TBfastProfiler) +EarlyQC merges TBProfiler (in fastq-input-mode) and fastp into one WDL step which will run unless `earlyQC_skip_entirely` is true or you are running myco_cleaned. TBProfiler does no site-specific filtering of its own, but if `earlyQC_skip_trimming` is false, fastp will further clean your FASTQs as a form of site-specific filtering. #### removing low-quality read pairs -`fastpQC_trim_qual_below` is piped into fastp's `average_qual`. If one read's average quality score is < `average_qual`, then that read/pair is discarded. You can disable this by setting `fastpQC_trim_qual_below` to 0 or `fastpQC_skip_trimming` to true. +`earlyQC_trim_qual_below` is piped into fastp's `average_qual`. If one read's average quality score is < `average_qual`, then that read/pair is discarded. You can disable this by setting `earlyQC_trim_qual_below` to 0 or `earlyQC_skip_trimming` to true. ### variant calling The variant caller used by all forms of myco uses clockwork, which itself leverages minos. minos will generate VCFs using two different methods, then compare the two of them, then output a final ajudicated VCF. @@ -45,8 +45,8 @@ Notes: ### decontamination Entire samples do not get filtered out here unless the decontamination task errors out, or you have timeouts -- specifically `timeout_decontam_part1` and `timeout_decontam_part2` -- set to a nonzero value. The reason for timeouts filtering out samples is that a sample taking a long time is itself a sign that the sample is heavily contaminated, and a heavily decontaminated sample is more likely to have too many sites removed for variant calling to work properly, which is useful if processing tens of thousands of samples from SRA of varying degrees of quality. It is, however, a lot fuzzier than most other forms of QC in this pipeline, so timeouts are turned off (set to 0) by default for myco_raw. For more information on the circumstances that can cause the decontamination task to error out, please see [status_codes.md](./status_codes.md). -### fastpQC -If more than `fastpQC_minimum_percent_q30` (as float where 0.5=50%) of your decontaminated FASTQs's calls are below Q30, and if `fastpQC_skip_QC` is false, and if `fastpQC_skip_QC` is also false, the sample will be removed with status `fastpQC_TOO_MANY_BELOW_Q30`. This is independent of fastp's site-specific filtering (eg, `fastpQC_skip_trimming`, and `fastpQC_trim_qual_below`). +### earlyQC +If more than `earlyQC_minimum_percent_q30` (as float where 0.5=50%) of your decontaminated FASTQs's calls are below Q30, and if `earlyQC_skip_QC` is false, and if `earlyQC_skip_QC` is also false, the sample will be removed with status `EARLYQC_TOO_MANY_BELOW_Q30`. This is independent of fastp's site-specific filtering (eg, `earlyQC_skip_trimming`, and `earlyQC_trim_qual_below`). ### variant calling As with decontamination, entire samples do not get filtered out here unless the variant caller has an error or times out. diff --git a/doc/status_codes.md b/doc/status_codes.md index 5d75e5d..6225d0a 100644 --- a/doc/status_codes.md +++ b/doc/status_codes.md @@ -12,12 +12,12 @@ Status codes represent the status of a given sample after it has completed the p | DECONTAMINATION_RM_CONTAM_KILLED | The rm_contam part of the decontamination process was killed (return code 137) | yes | Set the decontamination task's memory runtime attribute to a higher value (default: 16 GB) and rerun. | | DECONTAMINATION_RM_CONTAM_UNKNOWN_ERROR | The rm_contam part of the decontamination process had an unknown error | no | Open an issue on GitHub | | DECONTAMINATION_RM_CONTAM_TIMEOUT | The rm_contam part of the decontamination process went over `timeout_decontam_part2` minutes | yes, but the sample is suspect | This could be a sign your sample is very heavily contaminated. If you wish to continue attempting to use it, set `timeout_decontam_part2` to 0 and rerun. | -| fastpQC_TOO_MANY_BELOW_Q30 | fastp detected `early_qc_cutoff_q30`*100 percent of your FASTQs's calls have a quality score below 30 | yes, but the sample is suspect | This could be a sign your sample is very low quality, possibly due issues in sample purification or during sequencing. If you wish to continue attempting to use it, adjust `early_qc_cutoff_q30` to a lower value (default: 0.90) | +| EARLYQC_TOO_MANY_BELOW_Q30 | fastp detected `early_qc_cutoff_q30`*100 percent of your FASTQs's calls have a quality score below 30 | yes, but the sample is suspect | This could be a sign your sample is very low quality, possibly due issues in sample purification or during sequencing. If you wish to continue attempting to use it, adjust `early_qc_cutoff_q30` to a lower value (default: 0.90) | | VARIANT_CALLING_KILLED | The variant calling task was killed (return code 137) | yes, but the sample is suspect | Set `variantcalling_memory` to a higher value (default: 32 GB) and rerun, but be aware that running out of memory on default settings is quite unusual and may indicate an issue with the data. | | VARIANT_CALLING_TIMEOUT | The variant calling task went over `timeout_variant_caller` minutes | yes, but the sample is suspect | This could be a sign your sample is very small or very large. If you wish to continue attempting to use it, set `timeout_variant_caller` to 0. | | VARIANT_CALLING_UNKNOWN_ERROR | The variant calling task returned 1 for unknown reasons | no | Your FASTQs might be corrupt or almost entirely empty. | | VARIANT_CALLING_UNKNOWN_ERROR_$rc | The variant calling task returned $rc for unknown reasons | no | Your FASTQS might be corrupt or almost entirely empty. | -| VARIANT_CALLING_ADJUDICATION_FAILURE | The variant calling task failed, and it appears your sample has enough sites for minimap2 but not Cortex | yes, if sample can be bigger | It appears Cortex cannot find any variants to call. It's possible too much of it was removed during the decontamination step, or there was never much of it in the first place. Check the size of this sample's input FASTQs and compare that to the size of the FASTQs after the decontamination step and fastpQC. You *might* be able to recover this sample by running myco_cleaned on raw, not-downsampled FASTQs. | +| VARIANT_CALLING_ADJUDICATION_FAILURE | The variant calling task failed, and it appears your sample has enough sites for minimap2 but not Cortex | yes, if sample can be bigger | It appears Cortex cannot find any variants to call. It's possible too much of it was removed during the decontamination step, or there was never much of it in the first place. Check the size of this sample's input FASTQs and compare that to the size of the FASTQs after the decontamination step and earlyQC. You *might* be able to recover this sample by running myco_cleaned on raw, not-downsampled FASTQs. | | VCF2DIFF_TOO_MANY_LOW_COVERAGE_SITES | VCF-to-diff task found ≥`diffQC_max_percent_low_coverage`*100 percent of sample's sites are below `diffQC_low_coverage_cutoff` coverage | yes, but the sample is suspect | A diff file can still be generated if `diff_min_coverage_per_site` (default: 10) is set to 0, but note that low coverage sites will not be masked in the resulting diff file. ### not visible to user, but defined in documentation @@ -33,6 +33,6 @@ Status codes represent the status of a given sample after it has completed the p \ No newline at end of file diff --git a/inputs/myco_sra_terra_earlyQC.json b/inputs/myco_sra_terra_earlyQC.json index 8ee0e36..713f884 100644 --- a/inputs/myco_sra_terra_earlyQC.json +++ b/inputs/myco_sra_terra_earlyQC.json @@ -4,7 +4,7 @@ "myco.tree_to_decorate": "gs://topmed_workflow_testing/tb/trees/alldiffs_mask2ref.L.fixed.pb", "myco.diffQC_low_coverage_cutoff": 10, "myco.diffQC_mask_bedfile": "gs://topmed_workflow_testing/tb/R00000039_repregions.bed", - "myco.fastpQC_skip_entirely": "false", - "myco.fastpQC_skip_trimming": "false", - "myco.fastpQC_minimum_percent_q30": "0.90" + "myco.earlyQC_skip_entirely": "false", + "myco.earlyQC_skip_trimming": "false", + "myco.earlyQC_minimum_percent_q30": "0.90" } \ No newline at end of file diff --git a/myco_raw.wdl b/myco_raw.wdl index a6cf572..3326c52 100644 --- a/myco_raw.wdl +++ b/myco_raw.wdl @@ -6,8 +6,7 @@ import "https://raw.githubusercontent.com/aofarrel/SRANWRP/v1.1.12/tasks/process import "https://raw.githubusercontent.com/aofarrel/tree_nine/0.0.10/tree_nine.wdl" as build_treesWF import "https://raw.githubusercontent.com/aofarrel/parsevcf/1.2.0/vcf_to_diff.wdl" as diff import "https://raw.githubusercontent.com/aofarrel/tb_profiler/0.2.2/tbprofiler_tasks.wdl" as profiler -import "https://raw.githubusercontent.com/aofarrel/fastp-wdl/0.0.1/fastp_tasks.wdl" as fastp # fastpQC -import "https://raw.githubusercontent.com/aofarrel/TBfastProfiler/0.0.8/TBfastProfiler.wdl" as qc_fastqsWF # aka fastpQC +import "https://raw.githubusercontent.com/aofarrel/TBfastProfiler/0.0.8/TBfastProfiler.wdl" as qc_fastqsWF # aka earlyQC import "https://raw.githubusercontent.com/aofarrel/goleft-wdl/0.1.2/goleft_functions.wdl" as goleft @@ -22,11 +21,11 @@ workflow myco { File? diffQC_mask_bedfile Int diffQC_max_percent_low_coverage= 20 Int diffQC_low_coverage_cutoff = 10 - Int fastpQC_minimum_percent_q30 = 90 - Boolean fastpQC_skip_entirely = false - Boolean fastpQC_skip_QC = false - Boolean fastpQC_skip_trimming = false - Int fastpQC_trim_qual_below = 30 + Int earlyQC_minimum_percent_q30 = 90 + Boolean earlyQC_skip_entirely = false + Boolean earlyQC_skip_QC = false + Boolean earlyQC_skip_trimming = false + Int earlyQC_trim_qual_below = 30 Int quick_tasks_disk_size = 10 Int subsample_cutoff = -1 Int subsample_seed = 1965 @@ -55,11 +54,11 @@ workflow myco { diffQC_mask_bedfile: "Bed file of regions to mask when making diff files (default: R00000039_repregions.bed)" diffQC_max_percent_low_coverage: "Samples who have more than this percent (as int, 50 = 50%) of positions with coverage below diffQC_low_coverage_cutoff will be discarded" diffQC_low_coverage_cutoff: "Positions with coverage below this value will be masked in diff files" - fastpQC_minimum_percent_q30: "Decontaminated samples with less than this percent (as int, 50 = 50%) of reads above qual score of 30 will be discarded. Negated by fastpQC_skip_QC or fastpQC_skip_entirely being false." - fastpQC_skip_entirely: "Do not run fastpQC at all - no trimming, no QC, nothing." - fastpQC_skip_QC: "Run fastpQC (unless fastpQC_skip_entirely is true), but do not throw out samples that fail QC. Independent of fastpQC_skip_trimming." - fastpQC_skip_trimming: "Run fastpQC (unless fastpQC_skip_entirely is true), and remove samples that fail QC (unless fastpQC_skip_QC is true), but do not use fastp's cleaned fastqs." - fastpQC_trim_qual_below: "Trim reads with an average quality score below this value. Independent of fastpQC_minimum_percent_q30. Overridden by fastpQC_skip_trimming or fastpQC_skip_entirely being true." + earlyQC_minimum_percent_q30: "Decontaminated samples with less than this percent (as int, 50 = 50%) of reads above qual score of 30 will be discarded. Negated by earlyQC_skip_QC or earlyQC_skip_entirely being false." + earlyQC_skip_entirely: "Do not run earlyQC (fastp + fastq-TBProfiler) at all - no trimming, no QC, nothing. Does not affect tbprofiler_on_bam." + earlyQC_skip_QC: "Run earlyQC (unless earlyQC_skip_entirely is true), but do not throw out samples that fail QC. Independent of earlyQC_skip_trimming." + earlyQC_skip_trimming: "Run earlyQC (unless earlyQC_skip_entirely is true), and remove samples that fail QC (unless earlyQC_skip_QC is true), but do not use fastp's cleaned fastqs." + earlyQC_trim_qual_below: "Trim reads with an average quality score below this value. Independent of earlyQC_minimum_percent_q30. Overridden by earlyQC_skip_trimming or earlyQC_skip_entirely being true." quick_tasks_disk_size: "Disk size in GB to use for quick file-processing tasks; increasing this might slightly speed up file localization" paired_fastq_sets: "Nested array of paired fastqs, each inner array representing one samples worth of paired fastqs" subsample_cutoff: "If a fastq file is larger than than size in MB, subsample it with seqtk (set to -1 to disable)" @@ -76,7 +75,7 @@ workflow myco { # convert percent integers to floats (except covstatsQC_max_percent_unmapped) Float diffQC_max_percent_low_coverage_float = diffQC_max_percent_low_coverage / 100.0 - Float fastpQC_minimum_percent_q30_float = fastpQC_minimum_percent_q30 / 100.0 + Float earlyQC_minimum_percent_q30_float = earlyQC_minimum_percent_q30 / 100.0 scatter(paired_fastqs in paired_fastq_sets) { call clckwrk_combonation.combined_decontamination_single_ref_included as decontam_each_sample { @@ -96,24 +95,24 @@ workflow myco { File real_decontaminated_fastq_1=select_first([decontam_each_sample.decontaminated_fastq_1, paired_fastqs[0]]) File real_decontaminated_fastq_2=select_first([decontam_each_sample.decontaminated_fastq_2, paired_fastqs[0]]) - if(!fastpQC_skip_entirely) { + if(!earlyQC_skip_entirely) { call qc_fastqsWF.TBfastProfiler as qc_fastqs { input: fastq1 = real_decontaminated_fastq_1, fastq2 = real_decontaminated_fastq_2, - q30_cutoff = fastpQC_minimum_percent_q30_float, - average_qual = fastpQC_trim_qual_below, - use_fastps_cleaned_fastqs = !(fastpQC_skip_trimming) + q30_cutoff = earlyQC_minimum_percent_q30_float, + average_qual = earlyQC_trim_qual_below, + use_fastps_cleaned_fastqs = !(earlyQC_skip_trimming) } - # if we are filtering out samples via fastpQC... - if(!(fastpQC_skip_QC)) { + # if we are filtering out samples via earlyQC... + if(!(earlyQC_skip_QC)) { # and this sample passes... if(qc_fastqs.pass_or_errorcode == pass) { File possibly_fastp_cleaned_fastq1_passed=select_first([qc_fastqs.cleaned_fastq1, real_decontaminated_fastq_1]) File possibly_fastp_cleaned_fastq2_passed=select_first([qc_fastqs.cleaned_fastq2, real_decontaminated_fastq_2]) - call clckwrk_var_call.variant_call_one_sample_ref_included as variant_call_after_fastpQC_filtering { + call clckwrk_var_call.variant_call_one_sample_ref_included as variant_call_after_earlyQC_filtering { input: reads_files = [possibly_fastp_cleaned_fastq1_passed, possibly_fastp_cleaned_fastq2_passed], addldisk = variantcalling_addl_disk, @@ -132,11 +131,11 @@ workflow myco { } } - # if we are not filtering out samples via the early qc step (but ran fastpQC anyway)... - if(fastpQC_skip_QC) { + # if we are not filtering out samples via the early qc step (but ran earlyQC anyway)... + if(earlyQC_skip_QC) { File possibly_fastp_cleaned_fastq1=select_first([qc_fastqs.cleaned_fastq1, real_decontaminated_fastq_1]) File possibly_fastp_cleaned_fastq2=select_first([qc_fastqs.cleaned_fastq2, real_decontaminated_fastq_2]) - call clckwrk_var_call.variant_call_one_sample_ref_included as variant_call_after_fastpQC_but_not_filtering_samples { + call clckwrk_var_call.variant_call_one_sample_ref_included as variant_call_after_earlyQC_but_not_filtering_samples { input: reads_files = [possibly_fastp_cleaned_fastq1, possibly_fastp_cleaned_fastq2], addldisk = variantcalling_addl_disk, @@ -156,8 +155,8 @@ workflow myco { } # if we ARE skipping early QC (but the samples did decontaminate without erroring/timing out) - if(fastpQC_skip_entirely) { - call clckwrk_var_call.variant_call_one_sample_ref_included as variant_call_without_fastpQC { + if(earlyQC_skip_entirely) { + call clckwrk_var_call.variant_call_one_sample_ref_included as variant_call_without_earlyQC { input: reads_files = [real_decontaminated_fastq_1, real_decontaminated_fastq_2], addldisk = variantcalling_addl_disk, @@ -198,21 +197,21 @@ workflow myco { # all-samples-get-dropped-before-variant-calling-because-they-suck does not break this section... but it also means # we cannot use anything here to test if any version of the variant caller actually ran at all! # - Array[File] minos_vcfs_if_fastpQC_filtered = select_all(variant_call_after_fastpQC_filtering.adjudicated_vcf) - Array[File] minos_vcfs_if_fastpQC_but_not_filtering = select_all(variant_call_after_fastpQC_but_not_filtering_samples.adjudicated_vcf) - Array[File] minos_vcfs_if_no_fastpQC = select_all(variant_call_without_fastpQC.adjudicated_vcf) + Array[File] minos_vcfs_if_earlyQC_filtered = select_all(variant_call_after_earlyQC_filtering.adjudicated_vcf) + Array[File] minos_vcfs_if_earlyQC_but_not_filtering = select_all(variant_call_after_earlyQC_but_not_filtering_samples.adjudicated_vcf) + Array[File] minos_vcfs_if_no_earlyQC = select_all(variant_call_without_earlyQC.adjudicated_vcf) - Array[File] bams_if_fastpQC_filtered = select_all(variant_call_after_fastpQC_filtering.bam) - Array[File] bams_if_fastpQC_but_not_filtering = select_all(variant_call_after_fastpQC_but_not_filtering_samples.bam) - Array[File] bams_if_no_fastpQC = select_all(variant_call_without_fastpQC.bam) + Array[File] bams_if_earlyQC_filtered = select_all(variant_call_after_earlyQC_filtering.bam) + Array[File] bams_if_earlyQC_but_not_filtering = select_all(variant_call_after_earlyQC_but_not_filtering_samples.bam) + Array[File] bams_if_no_earlyQC = select_all(variant_call_without_earlyQC.bam) - Array[File] bais_if_fastpQC_filtered = select_all(variant_call_after_fastpQC_filtering.bai) - Array[File] bais_if_fastpQC_but_not_filtering = select_all(variant_call_after_fastpQC_but_not_filtering_samples.bai) - Array[File] bais_if_no_fastpQC = select_all(variant_call_without_fastpQC.bai) + Array[File] bais_if_earlyQC_filtered = select_all(variant_call_after_earlyQC_filtering.bai) + Array[File] bais_if_earlyQC_but_not_filtering = select_all(variant_call_after_earlyQC_but_not_filtering_samples.bai) + Array[File] bais_if_no_earlyQC = select_all(variant_call_without_earlyQC.bai) - Array[File] minos_vcfs = flatten([minos_vcfs_if_fastpQC_filtered, minos_vcfs_if_fastpQC_but_not_filtering, minos_vcfs_if_no_fastpQC]) - Array[File] final_bams = flatten([bams_if_fastpQC_filtered, bams_if_fastpQC_but_not_filtering, bams_if_no_fastpQC]) - Array[File] final_bais = flatten([bais_if_fastpQC_filtered, bais_if_fastpQC_but_not_filtering, bais_if_no_fastpQC]) + Array[File] minos_vcfs = flatten([minos_vcfs_if_earlyQC_filtered, minos_vcfs_if_earlyQC_but_not_filtering, minos_vcfs_if_no_earlyQC]) + Array[File] final_bams = flatten([bams_if_earlyQC_filtered, bams_if_earlyQC_but_not_filtering, bams_if_no_earlyQC]) + Array[File] final_bais = flatten([bais_if_earlyQC_filtered, bais_if_earlyQC_but_not_filtering, bais_if_no_earlyQC]) Array[Array[File]] bams_and_bais = [final_bams, final_bais] Array[Array[File]] bam_per_bai = transpose(bams_and_bais) @@ -391,18 +390,18 @@ workflow myco { } } - # handle fastpQC (if it ran at all) + # handle earlyQC (if it ran at all) - Array[String] fastpQC_array_coerced = select_all(qc_fastqs.pass_or_errorcode) - Array[String] fastpQC_errorcode_array = flatten([fastpQC_array_coerced, ["PASS"]]) # will fall back to PASS if fastpQC was skipped - if(!(fastpQC_errorcode_array[0] == pass)) { - String fastpQC_ERR = fastpQC_errorcode_array[0] + Array[String] earlyqc_array_coerced = select_all(qc_fastqs.pass_or_errorcode) + Array[String] earlyqc_errorcode_array = flatten([earlyqc_array_coerced, ["PASS"]]) # will fall back to PASS if earlyQC was skipped + if(!(earlyqc_errorcode_array[0] == pass)) { + String earlyqc_ERR = earlyqc_errorcode_array[0] } # Unfortunately, to check if the variant caller ran, we have to check all three versions of the variant caller. # # But there seems to be a bug in Cromwell that causes it to incorrectly define variables even when a task hasn't run. - # For example, if(defined(variant_call_after_fastpQC_filtering.errorcode)) returns true even if NO variant callers + # For example, if(defined(variant_call_after_earlyQC_filtering.errorcode)) returns true even if NO variant callers # ran at all. And if you put a select_first() in that if block, it will fall back to the next variable, indicating # that select_first() knows it's undefined but defined() does not. I have not replicated this behavior with an optional # non-scattered task, so I think this is specific to scattered tasks, or I am a fool and something is wrong with my @@ -412,13 +411,13 @@ workflow myco { # select_first() and select_all() instead of the seemingly buggy defined(). It's less clean, but it seems to be necessary. # The WDL 1.0 spec does not say what happens if you give select_all() an array that only has optional values, but # the WDL 1.1 spec says you get an empty array. Thankfully, Cromwell handles 1.0-select_all() like the 1.1 spec. - Array[String] errorcode_if_fastpQC_filtered = select_all(variant_call_after_fastpQC_filtering.errorcode) - Array[String] errorcode_if_fastpQC_but_not_filtering = select_all(variant_call_after_fastpQC_but_not_filtering_samples.errorcode) - Array[String] errorcode_if_no_fastpQC = select_all(variant_call_without_fastpQC.errorcode) + Array[String] errorcode_if_earlyQC_filtered = select_all(variant_call_after_earlyQC_filtering.errorcode) + Array[String] errorcode_if_earlyQC_but_not_filtering = select_all(variant_call_after_earlyQC_but_not_filtering_samples.errorcode) + Array[String] errorcode_if_no_earlyQC = select_all(variant_call_without_earlyQC.errorcode) # if the variant caller did not run, the fallback pass will be selected, even though the sample shouldn't be considered a pass, so # the final-final-final error code needs to have decontam's error come before the variant caller error. - Array[String] varcall_errorcode_array = flatten([errorcode_if_fastpQC_filtered, errorcode_if_fastpQC_but_not_filtering, errorcode_if_no_fastpQC, ["PASS"]]) + Array[String] varcall_errorcode_array = flatten([errorcode_if_earlyQC_filtered, errorcode_if_earlyQC_but_not_filtering, errorcode_if_no_earlyQC, ["PASS"]]) if(!(varcall_errorcode_array[0] == pass)) { String varcall_ERR = varcall_errorcode_array[0] } @@ -457,8 +456,8 @@ workflow myco { } # final-final-final error code - # fastpQC is at the end (but before PASS) to account for fastpQC_skip_QC = true - String finalcode = select_first([decontam_ERR, varcall_ERR, covstats_ERR, vcfdiff_ERR, fastpQC_ERR, pass]) + # earlyQC is at the end (but before PASS) to account for earlyQC_skip_QC = true + String finalcode = select_first([decontam_ERR, varcall_ERR, covstats_ERR, vcfdiff_ERR, earlyqc_ERR, pass]) } @@ -497,7 +496,7 @@ workflow myco { # status of sample, only valid iff this ran on only one sample String error_code = select_first([finalcode, pass]) String? debug_decontam_ERR = decontam_ERR - String? debug_fastpQC_ERR = fastpQC_ERR + String? debug_earlyqc_ERR = earlyqc_ERR String? debug_varcall_ERR = varcall_ERR String? debug_covstats_ERR = covstats_ERR String? debug_vcfdiff_ERR = vcfdiff_ERR diff --git a/myco_sra.wdl b/myco_sra.wdl index a77b119..174e4ca 100755 --- a/myco_sra.wdl +++ b/myco_sra.wdl @@ -8,7 +8,7 @@ import "https://raw.githubusercontent.com/aofarrel/tree_nine/0.0.10/tree_nine.wd import "https://raw.githubusercontent.com/aofarrel/parsevcf/1.2.0/vcf_to_diff.wdl" as diff import "https://raw.githubusercontent.com/aofarrel/fastqc-wdl/0.0.2/fastqc.wdl" as fastqc import "https://raw.githubusercontent.com/aofarrel/tb_profiler/0.2.2/tbprofiler_tasks.wdl" as profiler -import "https://raw.githubusercontent.com/aofarrel/TBfastProfiler/0.0.6/TBfastProfiler.wdl" as qc_fastqsWF # aka fastpQC +import "https://raw.githubusercontent.com/aofarrel/TBfastProfiler/0.0.6/TBfastProfiler.wdl" as qc_fastqsWF # aka earlyQC import "https://raw.githubusercontent.com/aofarrel/goleft-wdl/0.1.2/goleft_functions.wdl" as goleft @@ -28,11 +28,11 @@ workflow myco { # QC Boolean fastqc_on_timeout = false - Boolean fastpQC_skip_QC = false - Int fastpQC_minimum_percent_q30 = 90 - Boolean fastpQC_skip_trimming = false - Int fastpQC_trim_qual_below = 30 - Boolean fastpQC_skip_entirely = true + Boolean earlyQC_skip_QC = false + Int earlyQC_minimum_percent_q30 = 90 + Boolean earlyQC_skip_trimming = false + Int earlyQC_trim_qual_below = 30 + Boolean earlyQC_skip_entirely = true # shrink large samples Int subsample_cutoff = 450 @@ -72,11 +72,11 @@ workflow myco { diffQC_mask_bedfile: "Bed file of regions to mask when making diff files (default: R00000039_repregions.bed)" diffQC_max_percent_low_coverage: "Samples who have more than this percent (as int, 50 = 50%) of positions with coverage below diffQC_low_coverage_cutoff will be discarded" diffQC_low_coverage_cutoff: "Positions with coverage below this value will be masked in diff files" - fastpQC_minimum_percent_q30: "Decontaminated samples with less than this percent (as int, 50 = 50%) of reads above qual score of 30 will be discarded. Negated by fastpQC_skip_QC or fastpQC_skip_entirely being false." - fastpQC_skip_entirely: "Do not run fastpQC at all - no trimming, no QC, nothing." - fastpQC_skip_QC: "Run fastpQC (unless fastpQC_skip_entirely is true), but do not throw out samples that fail QC. Independent of fastpQC_skip_trimming." - fastpQC_skip_trimming: "Run fastpQC (unless fastpQC_skip_entirely is true), and remove samples that fail QC (unless fastpQC_skip_QC is true), but do not use fastp's cleaned fastqs." - fastpQC_trim_qual_below: "Trim reads with an average quality score below this value. Independent of fastpQC_minimum_percent_q30. Overridden by fastpQC_skip_trimming or fastpQC_skip_entirely being true." + earlyQC_minimum_percent_q30: "Decontaminated samples with less than this percent (as int, 50 = 50%) of reads above qual score of 30 will be discarded. Negated by earlyQC_skip_QC or earlyQC_skip_entirely being false." + earlyQC_skip_entirely: "Do not run earlyQC (fastp + fastq-TBProfiler) at all - no trimming, no QC, nothing. Does not affect tbprofiler_on_bam." + earlyQC_skip_QC: "Run earlyQC (unless earlyQC_skip_entirely is true), but do not throw out samples that fail QC. Independent of earlyQC_skip_trimming." + earlyQC_skip_trimming: "Run earlyQC (unless earlyQC_skip_entirely is true), and remove samples that fail QC (unless earlyQC_skip_QC is true), but do not use fastp's cleaned fastqs." + earlyQC_trim_qual_below: "Trim reads with an average quality score below this value. Independent of earlyQC_minimum_percent_q30. Overridden by earlyQC_skip_trimming or earlyQC_skip_entirely being true." fastqc_on_timeout: "If true, fastqc one read from a sample when decontamination or variant calling times out" quick_tasks_disk_size: "Disk size in GB to use for quick file-processing tasks; increasing this might slightly speed up file localization" subsample_cutoff: "If a fastq file is larger than than size in MB, subsample it with seqtk (set to -1 to disable)" @@ -91,7 +91,7 @@ workflow myco { # convert percent integers to floats (except covstatsQC_max_percent_unmapped) Float diffQC_max_percent_low_coverage_float = diffQC_max_percent_low_coverage / 100 - Float fastpQC_minimum_percent_q30_float = fastpQC_minimum_percent_q30 / 100 + Float earlyQC_minimum_percent_q30_float = earlyQC_minimum_percent_q30 / 100 call sranwrp_processing.extract_accessions_from_file as get_sample_IDs { input: @@ -136,24 +136,24 @@ workflow myco { File real_decontaminated_fastq_1=select_first([decontam_each_sample.decontaminated_fastq_1, biosample_accessions]) File real_decontaminated_fastq_2=select_first([decontam_each_sample.decontaminated_fastq_2, biosample_accessions]) - # if we are NOT skipping fastpQC - if(!fastpQC_skip_entirely) { + # if we are NOT skipping earlyQC + if(!earlyQC_skip_entirely) { call qc_fastqsWF.TBfastProfiler as qc_fastqs { input: fastq1 = real_decontaminated_fastq_1, fastq2 = real_decontaminated_fastq_2, - q30_cutoff = fastpQC_minimum_percent_q30_float, - average_qual = fastpQC_trim_qual_below, - output_fastps_cleaned_fastqs = !(fastpQC_skip_trimming) + q30_cutoff = earlyQC_minimum_percent_q30_float, + average_qual = earlyQC_trim_qual_below, + output_fastps_cleaned_fastqs = !(earlyQC_skip_trimming) } - # if we are filtering out samples via fastpQC... - if(!(fastpQC_skip_QC)) { + # if we are filtering out samples via earlyQC... + if(!(earlyQC_skip_QC)) { if(qc_fastqs.did_this_sample_pass) { File possibly_fastp_cleaned_fastq1_passed=select_first([qc_fastqs.cleaned_fastq1, real_decontaminated_fastq_1]) File possibly_fastp_cleaned_fastq2_passed=select_first([qc_fastqs.cleaned_fastq2, real_decontaminated_fastq_2]) - call clckwrk_var_call.variant_call_one_sample_ref_included as variant_call_after_fastpQC_filtering { + call clckwrk_var_call.variant_call_one_sample_ref_included as variant_call_after_earlyQC_filtering { input: reads_files = [possibly_fastp_cleaned_fastq1_passed, possibly_fastp_cleaned_fastq2_passed], @@ -171,12 +171,12 @@ workflow myco { } } - # if we are not filtering out samples via the early qc step (but ran fastpQC anyway)... - if(fastpQC_skip_QC) { + # if we are not filtering out samples via the early qc step (but ran earlyQC anyway)... + if(earlyQC_skip_QC) { File possibly_fastp_cleaned_fastq1=select_first([qc_fastqs.cleaned_fastq1, real_decontaminated_fastq_1]) File possibly_fastp_cleaned_fastq2=select_first([qc_fastqs.cleaned_fastq2, real_decontaminated_fastq_2]) - call clckwrk_var_call.variant_call_one_sample_ref_included as variant_call_after_fastpQC_but_not_filtering_samples { + call clckwrk_var_call.variant_call_one_sample_ref_included as variant_call_after_earlyQC_but_not_filtering_samples { input: reads_files = [possibly_fastp_cleaned_fastq1, possibly_fastp_cleaned_fastq2], @@ -195,8 +195,8 @@ workflow myco { } # if we ARE skipping early QC (but the samples did decontaminate without erroring/timing out) - if(fastpQC_skip_entirely) { - call clckwrk_var_call.variant_call_one_sample_ref_included as variant_call_without_fastpQC { + if(earlyQC_skip_entirely) { + call clckwrk_var_call.variant_call_one_sample_ref_included as variant_call_without_earlyQC { input: reads_files = [real_decontaminated_fastq_1, real_decontaminated_fastq_2], @@ -217,21 +217,21 @@ workflow myco { } # do some wizardry to deal with optionals - Array[File] minos_vcfs_if_fastpQC_filtered = select_all(variant_call_after_fastpQC_filtering.adjudicated_vcf) - Array[File] minos_vcfs_if_fastpQC_but_not_filtering = select_all(variant_call_after_fastpQC_but_not_filtering_samples.adjudicated_vcf) - Array[File] minos_vcfs_if_no_fastpQC = select_all(variant_call_without_fastpQC.adjudicated_vcf) + Array[File] minos_vcfs_if_earlyQC_filtered = select_all(variant_call_after_earlyQC_filtering.adjudicated_vcf) + Array[File] minos_vcfs_if_earlyQC_but_not_filtering = select_all(variant_call_after_earlyQC_but_not_filtering_samples.adjudicated_vcf) + Array[File] minos_vcfs_if_no_earlyQC = select_all(variant_call_without_earlyQC.adjudicated_vcf) - Array[File] bams_if_fastpQC_filtered = select_all(variant_call_after_fastpQC_filtering.bam) - Array[File] bams_if_fastpQC_but_not_filtering = select_all(variant_call_after_fastpQC_but_not_filtering_samples.bam) - Array[File] bams_if_no_fastpQC = select_all(variant_call_without_fastpQC.bam) + Array[File] bams_if_earlyQC_filtered = select_all(variant_call_after_earlyQC_filtering.bam) + Array[File] bams_if_earlyQC_but_not_filtering = select_all(variant_call_after_earlyQC_but_not_filtering_samples.bam) + Array[File] bams_if_no_earlyQC = select_all(variant_call_without_earlyQC.bam) - Array[File] bais_if_fastpQC_filtered = select_all(variant_call_after_fastpQC_filtering.bai) - Array[File] bais_if_fastpQC_but_not_filtering = select_all(variant_call_after_fastpQC_but_not_filtering_samples.bai) - Array[File] bais_if_no_fastpQC = select_all(variant_call_without_fastpQC.bai) + Array[File] bais_if_earlyQC_filtered = select_all(variant_call_after_earlyQC_filtering.bai) + Array[File] bais_if_earlyQC_but_not_filtering = select_all(variant_call_after_earlyQC_but_not_filtering_samples.bai) + Array[File] bais_if_no_earlyQC = select_all(variant_call_without_earlyQC.bai) - Array[File] minos_vcfs = flatten([minos_vcfs_if_fastpQC_filtered, minos_vcfs_if_fastpQC_but_not_filtering, minos_vcfs_if_no_fastpQC]) - Array[File] final_bams = flatten([bams_if_fastpQC_filtered, bams_if_fastpQC_but_not_filtering, bams_if_no_fastpQC]) - Array[File] final_bais = flatten([bais_if_fastpQC_filtered, bais_if_fastpQC_but_not_filtering, bais_if_no_fastpQC]) + Array[File] minos_vcfs = flatten([minos_vcfs_if_earlyQC_filtered, minos_vcfs_if_earlyQC_but_not_filtering, minos_vcfs_if_no_earlyQC]) + Array[File] final_bams = flatten([bams_if_earlyQC_filtered, bams_if_earlyQC_but_not_filtering, bams_if_no_earlyQC]) + Array[File] final_bais = flatten([bais_if_earlyQC_filtered, bais_if_earlyQC_but_not_filtering, bais_if_no_earlyQC]) # Now we need to essentially scatter on three arrays: minos_vcfs, bams, and bais. This is trivial in CWL, but # isn't intutive in WDL -- in fact, it's arguably not possible! @@ -363,7 +363,7 @@ workflow myco { if(fastqc_on_timeout) { Array[File] bad_fastqs_decontam_ = select_all(decontam_each_sample.check_this_fastq) - Array[File] bad_fastqs_varcallr_ = select_all(flatten([variant_call_without_fastpQC.check_this_fastq, variant_call_after_fastpQC_but_not_filtering_samples.check_this_fastq])) + Array[File] bad_fastqs_varcallr_ = select_all(flatten([variant_call_without_earlyQC.check_this_fastq, variant_call_after_earlyQC_but_not_filtering_samples.check_this_fastq])) Array[Array[File]] bad_fastqs_ = [bad_fastqs_decontam_, bad_fastqs_varcallr_] if(length(decontam_each_sample.check_this_fastq)>=1 && length(bad_fastqs_varcallr_)>=1) { Array[File] bad_fastqs_both = flatten(bad_fastqs_)