Skip to content

Commit

Permalink
Revert "Prepare for removal of TBfastProfiler (earlyQC)"
Browse files Browse the repository at this point in the history
This reverts commit d9fa782.
  • Loading branch information
aofarrel committed Sep 25, 2023
1 parent 012866b commit f6891b8
Show file tree
Hide file tree
Showing 6 changed files with 99 additions and 100 deletions.
2 changes: 1 addition & 1 deletion doc/inputs.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ myco_cleaned expects that the FASTQs you are putting into have already been clea
| diff_min_coverage_per_site | Int | 10 | Positions with coverage below this value will be masked in diff files |
| early_qc_apply_cutoffs | Boolean | false | If true, run fastp + TBProfiler on decontaminated fastqs and apply cutoffs to determine which samples should be thrown out. |
| early_qc_cutoff_q30 | Float | 0.9 | Decontaminated samples with less than this percentage (as float, 0.5 = 50%) of reads above qual score of 30 will be discarded iff early_qc_apply_cutoffs is also true. |
| fastpQC_skip_entirely | Boolean | false | Do not run early QC (fastp + fastq-TBProfiler) at all. Does not affect whether or not TBProfiler is later run on bams. Overrides early_qc_apply_cutoffs. |
| earlyQC_skip_entirely | Boolean | false | Do not run early QC (fastp + fastq-TBProfiler) at all. Does not affect whether or not TBProfiler is later run on bams. Overrides early_qc_apply_cutoffs. |
| fastqc_on_timeout | Boolean | false | (myco_sra only) If true, fastqc one read from a sample when decontamination or variant calling times out |

Note that all forms of QC will throw out entire samples, with two exceptions:
Expand Down
10 changes: 5 additions & 5 deletions doc/qc_and_filtering.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,11 +13,11 @@ If a FASTQ is above `subsample_cutoff` MB, it will get downsampled by seqtk. `su
### decontamination
An entire WDL task of myco (except myco_cleaned) is dedicated just to decontaminating reads. The decontamination workflow starts with `clockwork map_reads` to map to a decontamination reference, and then uses `clockwork rm_contam` to generate decontaminated FASTQs. It is worth noting that how long a sample spends in this decontamination step roughly correlates with how much contamination is in it, but input file size is also a factor. If you're seeing a batch of samples that are roughly the same size (or subject to default downsampling settings) as typical, but take unusually long to decontaminate, that batch of samples might be considered suspect.

### fastpQC
fastpQC runs fastp as both a QC step and a trimming step. Earlier versions of myco had "earlyQC" which merged fastp and fastq-TBProfiler into one subworkflow, but after version 5.0.2 Ash realized it's less confusing to just seperate these out.
### earlyQC (aka TBfastProfiler)
EarlyQC merges TBProfiler (in fastq-input-mode) and fastp into one WDL step which will run unless `earlyQC_skip_entirely` is true or you are running myco_cleaned. TBProfiler does no site-specific filtering of its own, but if `earlyQC_skip_trimming` is false, fastp will further clean your FASTQs as a form of site-specific filtering.

#### removing low-quality read pairs
`fastpQC_trim_qual_below` is piped into fastp's `average_qual`. If one read's average quality score is < `average_qual`, then that read/pair is discarded. You can disable this by setting `fastpQC_trim_qual_below` to 0 or `fastpQC_skip_trimming` to true.
`earlyQC_trim_qual_below` is piped into fastp's `average_qual`. If one read's average quality score is < `average_qual`, then that read/pair is discarded. You can disable this by setting `earlyQC_trim_qual_below` to 0 or `earlyQC_skip_trimming` to true.

### variant calling
The variant caller used by all forms of myco uses clockwork, which itself leverages minos. minos will generate VCFs using two different methods, then compare the two of them, then output a final ajudicated VCF.
Expand Down Expand Up @@ -45,8 +45,8 @@ Notes:
### decontamination
Entire samples do not get filtered out here unless the decontamination task errors out, or you have timeouts -- specifically `timeout_decontam_part1` and `timeout_decontam_part2` -- set to a nonzero value. The reason for timeouts filtering out samples is that a sample taking a long time is itself a sign that the sample is heavily contaminated, and a heavily decontaminated sample is more likely to have too many sites removed for variant calling to work properly, which is useful if processing tens of thousands of samples from SRA of varying degrees of quality. It is, however, a lot fuzzier than most other forms of QC in this pipeline, so timeouts are turned off (set to 0) by default for myco_raw. For more information on the circumstances that can cause the decontamination task to error out, please see [status_codes.md](./status_codes.md).

### fastpQC
If more than `fastpQC_minimum_percent_q30` (as float where 0.5=50%) of your decontaminated FASTQs's calls are below Q30, and if `fastpQC_skip_QC` is false, and if `fastpQC_skip_QC` is also false, the sample will be removed with status `fastpQC_TOO_MANY_BELOW_Q30`. This is independent of fastp's site-specific filtering (eg, `fastpQC_skip_trimming`, and `fastpQC_trim_qual_below`).
### earlyQC
If more than `earlyQC_minimum_percent_q30` (as float where 0.5=50%) of your decontaminated FASTQs's calls are below Q30, and if `earlyQC_skip_QC` is false, and if `earlyQC_skip_QC` is also false, the sample will be removed with status `EARLYQC_TOO_MANY_BELOW_Q30`. This is independent of fastp's site-specific filtering (eg, `earlyQC_skip_trimming`, and `earlyQC_trim_qual_below`).

### variant calling
As with decontamination, entire samples do not get filtered out here unless the variant caller has an error or times out.
Expand Down
6 changes: 3 additions & 3 deletions doc/status_codes.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,12 +12,12 @@ Status codes represent the status of a given sample after it has completed the p
| DECONTAMINATION_RM_CONTAM_KILLED | The rm_contam part of the decontamination process was killed (return code 137) | yes | Set the decontamination task's memory runtime attribute to a higher value (default: 16 GB) and rerun. |
| DECONTAMINATION_RM_CONTAM_UNKNOWN_ERROR | The rm_contam part of the decontamination process had an unknown error | no | Open an issue on GitHub |
| DECONTAMINATION_RM_CONTAM_TIMEOUT | The rm_contam part of the decontamination process went over `timeout_decontam_part2` minutes | yes, but the sample is suspect | This could be a sign your sample is very heavily contaminated. If you wish to continue attempting to use it, set `timeout_decontam_part2` to 0 and rerun. |
| fastpQC_TOO_MANY_BELOW_Q30 | fastp detected `early_qc_cutoff_q30`*100 percent of your FASTQs's calls have a quality score below 30 | yes, but the sample is suspect | This could be a sign your sample is very low quality, possibly due issues in sample purification or during sequencing. If you wish to continue attempting to use it, adjust `early_qc_cutoff_q30` to a lower value (default: 0.90) |
| EARLYQC_TOO_MANY_BELOW_Q30 | fastp detected `early_qc_cutoff_q30`*100 percent of your FASTQs's calls have a quality score below 30 | yes, but the sample is suspect | This could be a sign your sample is very low quality, possibly due issues in sample purification or during sequencing. If you wish to continue attempting to use it, adjust `early_qc_cutoff_q30` to a lower value (default: 0.90) |
| VARIANT_CALLING_KILLED | The variant calling task was killed (return code 137) | yes, but the sample is suspect | Set `variantcalling_memory` to a higher value (default: 32 GB) and rerun, but be aware that running out of memory on default settings is quite unusual and may indicate an issue with the data. |
| VARIANT_CALLING_TIMEOUT | The variant calling task went over `timeout_variant_caller` minutes | yes, but the sample is suspect | This could be a sign your sample is very small or very large. If you wish to continue attempting to use it, set `timeout_variant_caller` to 0. |
| VARIANT_CALLING_UNKNOWN_ERROR | The variant calling task returned 1 for unknown reasons | no | Your FASTQs might be corrupt or almost entirely empty. |
| VARIANT_CALLING_UNKNOWN_ERROR_$rc | The variant calling task returned $rc for unknown reasons | no | Your FASTQS might be corrupt or almost entirely empty. |
| VARIANT_CALLING_ADJUDICATION_FAILURE | The variant calling task failed, and it appears your sample has enough sites for minimap2 but not Cortex | yes, if sample can be bigger | It appears Cortex cannot find any variants to call. It's possible too much of it was removed during the decontamination step, or there was never much of it in the first place. Check the size of this sample's input FASTQs and compare that to the size of the FASTQs after the decontamination step and fastpQC. You *might* be able to recover this sample by running myco_cleaned on raw, not-downsampled FASTQs. |
| VARIANT_CALLING_ADJUDICATION_FAILURE | The variant calling task failed, and it appears your sample has enough sites for minimap2 but not Cortex | yes, if sample can be bigger | It appears Cortex cannot find any variants to call. It's possible too much of it was removed during the decontamination step, or there was never much of it in the first place. Check the size of this sample's input FASTQs and compare that to the size of the FASTQs after the decontamination step and earlyQC. You *might* be able to recover this sample by running myco_cleaned on raw, not-downsampled FASTQs. |
| VCF2DIFF_TOO_MANY_LOW_COVERAGE_SITES | VCF-to-diff task found ≥`diffQC_max_percent_low_coverage`*100 percent of sample's sites are below `diffQC_low_coverage_cutoff` coverage | yes, but the sample is suspect | A diff file can still be generated if `diff_min_coverage_per_site` (default: 10) is set to 0, but note that low coverage sites will not be masked in the resulting diff file.

### not visible to user, but defined in documentation
Expand All @@ -33,6 +33,6 @@ Status codes represent the status of a given sample after it has completed the p

<!---
| DECONTAMINATION_NOTHING_LEFT | Comparing the number of reads in your FASTQ before and after decontamination indicates that the vast majority of it was contamination | yes, but the sample is suspect | Your sample was heavily contaminated! If your sample started out large enough, there might be enough data left to continue, which you can attempt with AAAAAAA. |
| fastpQC_LOW_MEDIAN_COVERAGE | TBProfiler detected your sample has a median coverage below AAAAAAAAAAAAAAAAAA | yes, but the sample is suspect | It's very likely that your sample would be filtered out by later coverage checks even if this check was skipped. If you wish to continue attempting to use it anyway, adjust AAAAAAAAAAA |
| EARLYQC_LOW_MEDIAN_COVERAGE | TBProfiler detected your sample has a median coverage below AAAAAAAAAAAAAAAAAA | yes, but the sample is suspect | It's very likely that your sample would be filtered out by later coverage checks even if this check was skipped. If you wish to continue attempting to use it anyway, adjust AAAAAAAAAAA |
| TREE_TOO_MANY_LOW_COVERAGE_SITES |
--->
6 changes: 3 additions & 3 deletions inputs/myco_sra_terra_earlyQC.json
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
"myco.tree_to_decorate": "gs://topmed_workflow_testing/tb/trees/alldiffs_mask2ref.L.fixed.pb",
"myco.diffQC_low_coverage_cutoff": 10,
"myco.diffQC_mask_bedfile": "gs://topmed_workflow_testing/tb/R00000039_repregions.bed",
"myco.fastpQC_skip_entirely": "false",
"myco.fastpQC_skip_trimming": "false",
"myco.fastpQC_minimum_percent_q30": "0.90"
"myco.earlyQC_skip_entirely": "false",
"myco.earlyQC_skip_trimming": "false",
"myco.earlyQC_minimum_percent_q30": "0.90"
}
Loading

0 comments on commit f6891b8

Please sign in to comment.