All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- Clair3 HAC and SUP V5.0.0 models
- Update Spectre to v0.2.1, which generates a predicted karyotype that is included in the HTML report.
- Improved reliability of the haplocheck process.
- infer_sex causing SNP subworkflow to wait unnecessarily on completion of mosdepth
- Alignment QC report crashing due to missing unmapped histograms
- Unsupported [email protected] model
- Missing Clair3 mapping for v3.5.1 models
- The missing model mapping is provided for completeness, users are strongly encouraged to re-basecall data on newer models to take advantage of significant improvements.
- Workflow will exit early with an informative message and alignment report if there are no aligned reads in the BAM
- Workflow emitting
OPTIONAL_FILE
in some cases. - File format of alignment files created by the workflow can be explicitly controlled with
--output_xam_fmt [bam|cram]
.- The default is CRAM to match existing behaviour, for BAM use
--output_xam_fmt bam
.
- The default is CRAM to match existing behaviour, for BAM use
- Enable RUST_BACKTRACE to automatically provide improved log messages for modkit issues.
- JBrowse configuration as it is no longer supported.
- Unused
merge_threads
parameter.
- "No such file or directory" error for SV tandem repeat BED when user has assets in /data/
- Reconciled workflow with wf-template v5.1.4
modkit sample-probs
failing with downsampling and re-alignments.- Workflow waiting for
mosdepth
andreadStats
processes when--bam_min_coverage 0
.
- Output
{{sample}}.stats.json
file describing some key metrics for the analysis. - Summary of gene coverage if a 4-column BED is provided.
- Automated sex determination using relative coverage of chrX and chrY.
- Retry strategy added to
snp:aggregate_pileup_variants
to prevent out of memory error.
--GVCF --phased
will produce a phased GVCF.- Changed default phasing algorithm to
whatshap
, with the possibility to change the phasing tolongphase
with--use_longphase true
.- The intermediate phasing is still performed using
longphase
.
- The intermediate phasing is still performed using
- Setting
--snp --sv --phased
will emit individually phased SNPs and SVs. - Phased bedMethyl files now follow the pattern
{{ alias }}.wf_mods.{{ haplotype }}.bedmethyl.gz
. --sex
parameter usesXX
andXY
rather than "female" and "male".- Update
modkit
to v0.2.6. - Improved modkit runtime by increasing default threads and increasing the default interval size.
- Improved modkit runtime by increasing the default interval size and running modkit on individual contigs.
modkit
is now run only on chromosomes 1-22, X, Y and MT, unless--include_all_ctgs
is provided.- Increased minimum CPU requirement for the workflow to 16.
- Filtering of SVs using a BED file now includes sites only partially overlapping the specified regions.
basecaller_cfg
will be inferred from thebasecall_model
DS key of input read groups, if available- Providing
--basecaller_cfg
will not be required ifbasecall_model
is present in the DS tag of the read groups of the input BAM --basecaller_cfg
will be ignored if abasecall_model
is found in the input BAM
- Providing
- Reconciled workflow with wf-template v5.1.2
- Update to Clair3 v1.0.8.
- Update to longphase v1.7.1.
- Update schema to allow selection of multiple BAM files per sample in EPI2ME.
- Spectre CNV report not handling cases when no CNVs detected.
- Lines denoting normal maximum and pathogenic minimum thresholds now correctly displayed on STR repeat content plots.
- Workflow will not emit
sample.cram
ifsample.haplotagged.cram
has been created by the workflow to save storage. - Emitting nonsense
input.1
file
- Single-step joint phasing of SV and SNP.
--output_separate_phased
as the workflow emits only individually phased VCFs.- A copy of the reference and the generated reference cache is no longer output by the workflow.
- The workflow encourages use of readily available standard reference sequences, so re-emitting the input reference as a workflow output is unnecessarily consuming disk space.
- ClinVar version in SnpEff container updated to version 20240307.
- Convert to BAM only when
--cnv --use_qdnaseq
is selected. - Update to Clair3 v1.0.6.
- Update Spectre to fix an error when parsing Clair3 VCFs with multiple AFs.
- Support for an input folder of multiple BAM files per sample with
--bam
(instead of only allowing a single BAM per sample). refine_with_sv
to be run by chromosome in order to reduce memory footprint.
- Force minimap2 to clean up memory more aggressively. Empirically this reduces peak-memory use over the course of execution.
- Handling of input VCF files with
--vcf_fn
. --phased --sv --snp
generates a truncated VCF file when#
appears in the VCFINFO
field- Some reporting scripts using too much memory.
- CRAM as supported input format.
old_ref
parameter as providing the reference of an existing CRAM is no longer needed.
- CNV calling with
--cnv
is now performed using Spectre, which is optimised for long reads.- Legacy CNV calling using QDNAseq may still be carried out with
--cnv --use_qdnaseq
. - The bin size parameter has been renamed from
--bin_size
to--qdnaseq_bin_size
.
- Legacy CNV calling using QDNAseq may still be carried out with
- Skip CNV CRAM to BAM conversion if downsampling is required, to avoid creating an unnecessary intermediate file.
- The output of
--depth_intervals
now has.bedgraph.gz
extension. - SV workflow outputs SVs in the autosomes, sex chromosomes and MT; use
--include_all_ctgs
to output calls on all the sequences.
- Output definitions for coverage files.
- N50 and mean coverage added to alignment report.
- EPI2ME Desktop incorrectly allowed selection of directory for
tr_bed
. failedQCReport
failing to generate a report.
- Add an additional
whatshap haplotag
process after the final VCF phasing. - Updates to the phasing subworkflow significantly impact the runtime and storage requirement for the workflow, as detailed here.
- Several performance improvements which should noticeably reduce the running time of the workflow
- Updated the version of Straglr, which addresses the following:
- Repeats can now be called in RFC1
- Start position of called STRs is 1-based rather than 0-based
- VCF headers now match those in the
FORMAT
field
- Generate
allChromosomes.bed
usingsamtools faidx
index instead ofpyfaidx
, to avoid a KeyError - Inconsistent file ownership of bundled Clair3 model files which could lead to subuid errors in some environments
- Bug report form.
- Clair3 4.3.0 models.
--phase_vcf
,--joint_phasing
and--phase_mod
are now deprecated for--phased
; see the README for more details.--use_longphase_intermediate
is now deprecated, and--use_longphase false
will usewhatshap
throughout consistently- Running
--phase --snp --use_longphase false
will now phase indels too --basecalling_cfg
currently provides the configuration to Clair3.- The
clair3:
prefix to Clair3 specific models is no longer required.
- SNP workflow ignoring the mitochondrial genome.
- CNV report generation fails if there is no consensus on the copy number of a chromosome
Undetermined
category has been added to theChromosome Copy Summary
to account for these cases
readStats
reports metrics on the downsampled BAM when--downsample_coverage
is requested.- Spurious warning for missing MM/ML tags when a BAM fails the coverage threshold
- wf-basecalling subworkflow
- fast5_dir input and other basecalling related options have been removed from the workflow parameters
- Users should run the standalone wf-basecalling workflow and provide the output to wf-human-variation
- Mapula statistics with
--mapula
--joint_phasing
generating single-chromosome VCF files.
- ClinVar annotation of SVs has been temporarily removed due to not being correctly incorporated. SnpEff annotations are still produced as part of the final SV VCF.
- New documentation
--annotation_threads
parameter, as the SnpEff process does not support multithreading.
- Truncated SV VCF header generated from
vcfsort
. sed
crashing with I/O error in some instances.- Missing flagstats file in output directory.
- STR workflow report now includes additional plots which display repeat units and interruptions in each supporting read
- CNV workflow now outputs an indexed VCF file to the output directory
- Legend symbols in STR genotpying plot
- Unambiguous naming of bedMethyl files generated with
--mod
- Unphased outputs will have the pattern
[sample_name].wf_mods.bedmethyl.gz
- Phased outputs will have the pattern
[sample_name]_[1|2|ungrouped].wf_mods.bedmethyl.gz
- Unphased outputs will have the pattern
- Report step failing if bcftools stats file has only some sub-sections
- Clair3 ignoring the bed file
- merge_haplotagged_contigs incorrectly generating intermediate CRAM when input is BAM
- STR content generation failing due to forward slash in disease name in
variant_catalog_hg38.json
- Report name for the read alignment statistics now follows the pattern
[sample_name].wf-human-alignment-report.html
- configure-jbrowse breaking on unescaped spaces
- The SNP workflow will filter the final VCF to only return calls in the regions specified with
--bed
- Avoid clair3 calling variants in regions that flank those specified in the BED
- Add a process to sanitise BED files
- SNP subworkflow was ignoring BED file and analysing all regions
- Report SV crashing when generating the size plots with only large indels
- Downsampling not working when targeting regions with
--bed
--phase_mod
not emitting the haplotagged bam files- Report crashing when loading a clinvar-annotated VCF file with multiple
GENEINFO
/CLNVC
entries
- Default local executor CPU and RAM limits
- SV size barchart causing extremely large reports.
- The plot has been replaced with a distribution plot fixed to +/- 5KBp. The summary table reports the INS and DEL min/max values.
- Workflow could not be launched from EPI2ME desktop app due to incorrectly quoted fields in schema
- replaced
--methyl
with--mod
, and--phase_methyl
with--phase_mod
- replaced
modbam2bed
withmodkit
(v0.1.12) --phase_mod
will generate three bed files, one for each haplotype and one for the reads that are not tagged.
- Add locus for LRP12 to BED file of STR repeats (GRCh38)
- When
--bed
and--bam_min_coverage
are specified, the workflow will process the regions passing the coverage filters whatshap
v2.0 in base workflow
- Patch
whatshap stats
crashing when no heterozygote sites are found in a contig - Patch
makeReport
crashing when loading empty ClinVar VCFs
- Increased minimum memory required for the workflow to 16 GB of RAM to reduce alignment failures
- sv.filterBam missing output file when creating BAM CSI
- VCFs generated by the
--sv
option are now automatically annotated withSnpEff
, incorporating ClinVar annotations- This can be switched off with
--skip_annotation
- This can be switched off with
—-skip_annotation
disables attempt to determine human genome version, enabling--snp
,--sv
and--phase_methyl
to be called on genomes which aren’t hg19 or hg38- Updated example command displayed when running
--help
- The ClinVar table in
--snp
and--sv
reports is now sorted according to clinical significance, and includes HGVS cDNA and protein descriptions - Workflow options for disabling steps have been updated for consistency:
--skip_annotation
is now--annotation false
--skip_refine_snp_with_sv
is now--refine_snp_with_sv false
--phase_methyl
also calls modifications using all reads to account for unphased regionsInput options
andOutput options
have been combined in theMain options
category- Updated Clair3 to v1.0.4
- SV workflow does not filter the SV calls by size and type
- Report for bam files failing the depth threshold is now consistent with the report of bam passing the hard threshold
- Perform Sniffles SV phasing with
--phase_sv
--phase_vcf
now run Sniffles SV in phasing mode
- 'mosdepth_downsampled' is defined more than once warning
- Workflow crashing when providing a reference with spaces/brackets in file name
- Workflow not emitting GVCF even when requested
- GVCF sample name not matching
sample_name
--cnv
subworkflow sometimes reporting incorrect genetic sex, due to the way segment copy numbers were aggregated across chromosomes
- Add downsampling of large bam files
--joint_phasing
allows an additional joint SNP and SV phasing--joint_phasing
and--phase_vcf
now emit a phased block GTF file to facilitate visualization
--sv_types
, all SV types are returned without filtering--max_sv_length
, all SVs are returned regardless of maximum size
- GPU tasks are limited to run in serial by default to avoid memory errors
- Users in cluster and cloud environments where GPU devices are scheduled must use
-profile discrete_gpus
to parallelise GPU work - A warning will be printed if the workflow detects it is running non-local execution but the discrete_gpus profile is not enabled
- Additional guidance on GPU support is provided in our Quickstart
- Users in cluster and cloud environments where GPU devices are scheduled must use
- ModuleNotFoundError on callCNV step when transforming some VCFs
- Malformed VCF created by annotation step if Java emits a warning
- VCFs generated by the
--snp
option are now automatically annotated withSnpEff
, incorporating ClinVar annotations- This can be switched off with
--skip_annotation
- This can be switched off with
- Bumped minimum required Nextflow version to 22.10.8
- Updated wf-basecalling subworkflow to 0.7.1 (Dorado 0.3.0)
- Enum choices are enumerated in the
--help
output - Enum choices are enumerated as part of the error message when a user has selected an invalid choice
- Made it easier to see which basecaller configurations can be used for basecalling and which configurations are presented to allow small variant calling of existing datasets
- Basecaller configurations prefixed with
clair3:
cannot be used for basecalling
- Basecaller configurations prefixed with
- v4.2.0 basecalling models, which must be used for sequencing runs performed at new 5 kHz sampling rate
- v4.1.0 basecalling models replace v4.0.0 models and must be used for sequencing runs performed at 4 kHz sampling rate
- Clair3 260bps models
- Workflow
get_filter_calls_command
crashes with stranded interval BED - CRAM inputs are now converted internally to BAM to avoid CNV subworkflow crash
- Basecalls and alignments are emitted in BAM to support CNV subworkflow if selected
- Workflow incorrectly prompting for an
--old_ref
when providing unaligned CRAM
- Configuration for running demo data in AWS
- Reports for wf-human-sv and wf-human-snp
- Coverage plot not working when a bed region file is provided
- Workflow crashing with
--bam_min_coverage 0
- Subworkflows that require hg19/GRCh37 now correctly accept references where chromosome sequence names do not have the 'chr' prefix
- Depth of sequencing plots moved to alignment report
- Corrected bin size unit displayed on CNV report
- Workflow outputs alignment statistics report when alignment has been performed
- Updated Clair3 to v1.0.1 to bump WhatsHap dependency to v1.7
- Bumped base container to use samtools 1.17 to prevent user reported segfault during minimap2_ubam process
--depth_intervals
will output a bedGraph file with entries for each genomic interval featuring homogeneous depth--phase_methyl
will output haplotype-level methylation calls, using the HP tags added by whatshap in the wf-human-snp workflow--sv_benchmark
will benchmark SV calls with Truvari- The workflow uses the 'NIST_SVs_Integration_v0.6' truth set and benchmarking should be carried out on HG002 data.
- Added coverage barrier to wf-human-variation
- When the SNP and SV subworkflows are both selected, the workflow will use the results of the SV subworkflow to refine the SNP calls
- This new behaviour can be disabled with
--skip_refine_snp_with_sv
- This new behaviour can be disabled with
- Updated to Oxford Nanopore Technologies PLC. Public License
- Workflow requires target regions to have 20x mean coverage to permit analysis
report_sv.py
now handles empty VCF filesreport_str.py
now handles mono-allelic STR- WhatsHap processes do not block for long time with CRAM inputs due to missing REF_PATH
- filterCalls failure when input BED has more than three columns
- get_genome assumed STR subworkflow was always enabled, preventing CNV analysis with hg19
- "genome build (...) is not compatible with this workflow" was incorrectly thrown when the workflow stopped before calling get_genome
--str
enables STR genotyping using Straglr (compatible with genome build 38 only)
- Minor performance improvement when checking for empty VCF in aggregate pileup steps
--cnv
enables CNV calling using QDNAseq
- Updated Dorado container image to use Dorado v0.1.1
- Latest models are now v4.0.0
- Workflow prints a more helpful error when Dorado fails due to unknown model name
- Updated wf-human-snp container image to load new Clair3 models for v4 basecalling
- Default
basecaller_cfg
set to[email protected]
--basecaller_args
may be used to provide custom arguments to the basecalling process
- Default
basecaller_cfg
set to[email protected]
- Updated description in manifest
nextflow run epi2me-labs/wf-human-variation --version
will now print the workflow version number and exit
--modbam2bed_args
can be used to further configure the wf-methylationmodbam2bed
processmodbam2bed
outputs are now prefixed with<sample_name>.methyl
--basecall_cfg
is now required by the SNP calling subworkflow to automatically pick a suitable Clair3 model- Users no longer have to provide
--model
for the SNP calling subworkflow
- Users no longer have to provide
- Tidied workflow parameter schema
- Some advanced options that are primarily used for benchmarking are now hidden but can be listed with
--help --show_hidden_params
- Some advanced options that are primarily used for benchmarking are now hidden but can be listed with
- wf-basecalling subworkflow now separates reads into pass and fail CRAMs based on mean qscore
- The workflow will index and align both the pass and fail reads and provide a CRAM for each in the output directory
- Only pass reads are used for downstream variant calling
- Updated wf-human-variation-snp container to use Sniffles v2.0.7
-profile conda
is no longer supported, users should use-profile standard
(Docker) or-profile singularity
instead--report_name
is no longer required and reports will be prefixed with--sample_name
instead
- Workflow will exit with "No files match pattern" if no suitable files are found to basecall
- Ensure to set
--dorado_ext
tofast5
orpod5
as appropriate
- Ensure to set
- JBrowse2 configuration failed to load AlignmentTrack for CRAM output
- Workflow will now output a JBrowse2
jbrowse.json
configuration - Workflow reports will be visible in the Nextflow Tower reports tab
- wf-basecalling 0.0.1 has been integrated to wf-human-variation
- Basecalling is now conducted with Dorado
- Basecalling options have changed, users are advised to check the basecalling options in
--help
for guidance
- GPU accelerated processes can now have their process directives generically modified in downstream user profiles using
withLabel:gpu
mapula
will no longer run by default due to performance issues on large data sets and must be enabled with--mapula
- This step will be deprecated and replaced with an equivalent in a future release
- uBAM input no longer requires an index
- CRAM is written to the output directory in all cases where alignment was completed
check_for_alignment
did not support uBAM- Set
PYTHONNOUSERSITE
to reduce risk of environment bleed
- Experimental guppy subworkflow
- We do not provide a Docker container for Guppy at this time, users will need to override
process.withLabel:wf_guppy.container
- We do not provide a Docker container for Guppy at this time, users will need to override
--ubam
option removed, users can now pass unaligned or aligned BAM (or CRAM) to--bam
- If the input BAM is aligned and the provided
--ref
does not match the SQ lines (exact match to name and size) the file will be realigned to--ref
- If the input CRAM is aligned and the provided
--ref
does not match the SQ lines (exact match to name and size) the file will be realigned to--ref
, but will also require the old reference to be provided with--old_ref
to avoid reference lookups - If the input is not aligned, the file will be aligned to
--ref
- If the input BAM is aligned and the provided
- Chunks without variants will no longer terminate the workflow at the
create_candidates
step
mapula
used to generate basic alignment QC statistics in both CSV and JSON
- Updated Clair3 to 0.1.12 (build 6) which bundles Longphase 1.3 to enable CRAM support and small improvement to accuracy
mosdepth
artifacts are now written to the output directory- Additionally outputs read counts at 1,10,20,30X coverage thresholds for each region in the input BAM (or each sequence of the reference if no BED is provided)
- Simplified
wf-human-sv
Docker image and conda definition
- Outdated conda environment definitions
- Docker based profiles no longer requires an internet connection to fetch the
cram_cache
cache building script
- "No such property" when using the
minimap2_ubam
alignment step - Slow performance on
minimap2_ubam
step when providing CRAM as--ubam
- Slow performance on
snp:readStats
process
- "Missing reference index" warning was unnecessary
- An experimental methylation subworkflow has been integrated, using
modbam2bed
to aggregate modified base counts (input BAM should haveMM
andML
tags), enable this with--methyl
- Workflow experimentally supports CRAM as input, uBAM input uses CRAM for intermediate files
- Reference FAI is now created if it does not exist, rather than raising an error
--sniffles_args
may be used to provide custom arguments to thesniffles2
process- Output files are uniformly prefixed with
--sample_name
- Output alignment from
--ubam
is now CRAM formatted
- Existence of Clair3 model directory is checked before starting workflow
--GVCF
and--include_all_ctgs
are correctly typed as booleans--GVCF
now outputs GVCF files to the output directory as intended--include_all_ctgs
no longer needs to be set as--include_all_ctgs y
--ubam_bam2fq_threads
and--ubam_sort_threads
allow finer control over the resourcing of the alignment step- The number of CPU required for
minimap2_ubam
is sum(ubam_bam2fq_threads
,ubam_map_threads
,ubam_sort_threads
)
- The number of CPU required for
--ubam_threads
is now--ubam_map_threads
and only sets the threads for the mapping specifically--ref
is now required by the schema, preventing obscure errors when it is not provided- Print helpful warning if neither
--snp
or--sv
have been specified - Fastqingress metadata map
- Disable dag creation by default to suppress graphviz warning
- Tandem repeat BED is now correctly set by
--tr_bed
- Update
fastcat
dependency to 0.4.11 to catch cases where input FASTQ cannot be read - Sanitize fastq intermittent null object error
- Switched to "new flavour" CI, meaning that containers are released from workflows independently
- Ported wf-human-snp (v0.3.2) to modules/wf-human-snp
- Ported wf-human-sv (v0.1.0) to modules/wf-human-sv
- Initialised wf-human-variation from wf-template #195cab5