Dada2 MergePairs : consensus between merging & concatenating reads #799

weber8thomas · 2024-11-18T15:00:54Z

Authors

@weber8thomas - Thomas Weber (EMBL Heidelberg)- Technical implementation, method parameter configuration accessible from the pipeline config, testing, and ensuring compliance with nf-core requirements.
@nhenry50 - Nicolas Henry (Station Biologique de Roscoff) - Concept & Methods development, algorithm design, benchmarking
@lplanat - Laurine Planat (EMBL Heidelberg)- Testing, feedback provision, benchmarking
@ FloraVincent - Flora Vincent (EMBL Heidelberg) - Conceptualisation, project supervision

Context & description

This pull request introduces an enhancement to the DADA2::mergePairs() process within the pipeline, enabling conditional merge or concatenation of sequences based on the overlap between forward and reverse reads. Previously, the pipeline allowed only to use either one of the two methods (merging or concatenating) of paired-end reads, via the --concatenate_reads parameter. This enhancement introduces the "consensus" method, allowing the pipeline to dynamically determine the appropriate method—merging or concatenation—thereby improving sequence assembly accuracy and downstream analysis outcomes.

The core enhancement revolves around incorporating conditional logic to assess the overlap between paired-end reads and decide whether to merge or concatenate them. This decision is based on a specified overlap threshold, ensuring that only reads with adequate overlap are merged, while others are concatenated with a defined spacer.

Enhancement Highlights:

Dual Invocation of mergePairs:
- Merging: Invoked with justConcatenate = FALSE to attempt merging where possible.
- Concatenation: Invoked with justConcatenate = TRUE to concatenate reads where merging isn't feasible.
Overlap Threshold Calculation:
- Calculates a minimum overlap threshold (min_overlap_obs) based on accepted mergers.
- Utilizes the 0.1th percentile (quantile(min_overlap_obs, 0.001)) to determine a stringent cutoff. This ensures that only read pairs with exceptionally high overlap are merged into consensus sequences, while those with insufficient overlap are concatenated, thereby maintaining sequence accuracy and integrity.
Conditional Replacement:
- Iterates through each sample's mergers.
- Replaces non-accepted mergers with concatenated sequences if the overlap falls below the threshold.
- Filters out any non-accepted, non-concatenated sequences to maintain data integrity.

Parameters changed or introduced

concatenate_reads
- Options:
  - TRUE: Enable concatenation (already existing)
  - FALSE: Disable concatenation (already existing)
  - "consensus": Enables conditional merging or concatenation based on the overlap between reads.
minoverlap
- Description: Sets the minimum required overlap length for merging paired-end reads.
- Default Value: 12
- Usage: Determines the threshold below which reads will not be merged and may be concatenated instead.
maxmismatch
- Description: Defines the maximum allowed mismatches during the merging process.
- Default Value: 0
- Usage: Controls the stringency of the merging criteria by limiting the number of mismatches permitted between overlapping regions.
gap
- Description: Specifies the gap penalty used during the alignment process in merging.
- Default Value: -64
- Usage: Influences the alignment algorithm's handling of gaps, affecting the quality of merged sequences.
match
- Description: Sets the match score for the alignment algorithm.
- Default Value: 1
- Usage: Determines the scoring for matching bases during the alignment, impacting the alignment sensitivity.
mismatch
- Description: Sets the mismatch penalty for the alignment algorithm.
- Default Value: -64
- Usage: Determines the penalty for mismatched bases during alignment, affecting the alignment specificity.

This approach is particularly beneficial for datasets containing both prokaryotic and eukaryotic sequences, where lengths vary, leading to differing overlap extents.

PR checklist

Release 2.9.0

Release 2.10.0

Release 2.11.0

Release 2.12.0

…tartup

d4straub

Dear Thomas and colleagues,

that seems like a great addition! Thanks a lot for working on it and opening a PR!

It seems like you did some benchmarking, would you have a benchmarking dataset that could be used for routine CI tests (maybe after downsampling)?

I have a few reservations with the implementation though:

The history seems to be messed up, maybe this is based on an older fork/branch? This shows in missing parameters and therefore failing CI tests. This also blocks me from testing the code myself. This is a breaking issue.
The new parameters are very generic for sequence comparisons, maybe it would be good to name them more specifically, e.g. by a prefix
A little more documentation would be great

More details to that points below.

d4straub · 2024-11-19T09:43:17Z

nextflow_schema.json

+                },
+                "match": {
+                    "type": "integer",
+                    "description": "",
+                    "help_text": ""
+                },
+                "mismatch": {
+                    "type": "integer",
+                    "description": "",
+                    "help_text": ""
+                },
+                "gap": {
+                    "type": "integer",
+                    "description": "",
+                    "help_text": ""
+                },
+                "minoverlap": {
+                    "type": "integer",
+                    "description": "",
+                    "help_text": ""
+                },
+                "maxmismatch": {
+                    "type": "integer",
+                    "description": "",
+                    "help_text": ""


A description to each parameter would be great.

I am not sure about the parameter names, they are very generic for any sequence comparison. Some might fit also taxonomic classifications or future additions. So I think it would be helpful to prepend by a prefix, e.g. asv_ or such (and in case yes, also prepend concatenate_reads)?

d4straub · 2024-11-19T09:46:44Z

nextflow.config

The change here seems unnecessarily excessive, also at least save_intermediates is just lost (ancombc params are also not there)? I assume they are from outdated files? Please revert and only add the new parameters.

d4straub · 2024-11-19T09:47:12Z

conf/modules.config

+
+

Suggested change

d4straub · 2024-11-19T09:47:38Z

CHANGELOG.md

That would need an update with a table of new params

d4straub · 2024-11-19T09:50:05Z

assets/multiqc_config.yml

That should not be changed, maybe something messed with the history?

d4straub · 2024-11-19T09:54:28Z

modules/local/dada2_denoising.nf

+
+            min_overlap_obs <- Reduce(c, min_overlap_obs)
+            min_overlap_obs <- min_overlap_obs[!is.na(min_overlap_obs)]
+            min_overlap_obs <- quantile(min_overlap_obs, 0.001)


Why is 0.001 hardcoded? Would it be beneficial to have that as a param or, if not, accessible via the config, i.e. via conf/modules.config ?
Edit: maybe as def args3 = task.ext.args3 ?: '0.001'?

d4straub · 2024-11-19T09:55:10Z

modules/local/dada2_denoising.nf

+            }
+
+            # define the overlap threshold to decide if concatenation or not
+


for me, there are some empty lines too much in there, could you remove some?

…toff parameter

weber8thomas · 2024-11-20T13:24:06Z

Switch to #803 due to git history messing

d4straub and others added 12 commits April 3, 2024 12:24

Merge pull request nf-core#725 from nf-core/dev

717abb8

Release 2.9.0

Merge pull request nf-core#755 from nf-core/dev

3f40a1b

Release 2.10.0

Merge pull request nf-core#771 from nf-core/dev

0473e15

Release 2.11.0

Release 2.12.0

8f139ce

Release 2.12.0

bump version to 2.13.0dev

66652f6

add consensus strategy

1d6196d

increase min number of reads consensus 1000

245e469

fix one sample/run bug

a7b47d5

fix consensus strategy

af82b3f

chore: add config parameters for dada2::denoising

274c0c9

chore: add config parameters for dada2::denoising

821628a

chore: Update nextflow_schema.json to display changed parameters at s…

3393dc5

…tartup

weber8thomas changed the title ~~V0.0.3 params~~ Dada2 MergePairs : consensus between merging & concatenating reads Nov 18, 2024

d4straub requested changes Nov 19, 2024

View reviewed changes

weber8thomas and others added 4 commits November 20, 2024 10:40

schema update ; add asv_ prefix to parameters ; add asv_percentile_cu…

df26046

…toff parameter

update qiime2 2023.7 to 2024.10

1024dff

add note to about QIIME2 2024.10 in ref_databases.config

23f39da

update CHANGELOG

ce6cd0a

weber8thomas closed this Nov 20, 2024

weber8thomas deleted the v0.0.3-params branch November 20, 2024 13:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dada2 MergePairs : consensus between merging & concatenating reads #799

Dada2 MergePairs : consensus between merging & concatenating reads #799

weber8thomas commented Nov 18, 2024 •

edited

Loading

d4straub left a comment

d4straub Nov 19, 2024

d4straub Nov 19, 2024

d4straub Nov 19, 2024

d4straub Nov 19, 2024

d4straub Nov 19, 2024

d4straub Nov 19, 2024 •

edited

Loading

d4straub Nov 19, 2024

weber8thomas commented Nov 20, 2024

		}

		# define the overlap threshold to decide if concatenation or not

Dada2 MergePairs : consensus between merging & concatenating reads #799

Dada2 MergePairs : consensus between merging & concatenating reads #799

Conversation

weber8thomas commented Nov 18, 2024 • edited Loading

Authors

Context & description

Enhancement Highlights:

Parameters changed or introduced

PR checklist

d4straub left a comment

Choose a reason for hiding this comment

d4straub Nov 19, 2024

Choose a reason for hiding this comment

d4straub Nov 19, 2024

Choose a reason for hiding this comment

d4straub Nov 19, 2024

Choose a reason for hiding this comment

d4straub Nov 19, 2024

Choose a reason for hiding this comment

d4straub Nov 19, 2024

Choose a reason for hiding this comment

d4straub Nov 19, 2024 • edited Loading

Choose a reason for hiding this comment

d4straub Nov 19, 2024

Choose a reason for hiding this comment

weber8thomas commented Nov 20, 2024

weber8thomas commented Nov 18, 2024 •

edited

Loading

d4straub Nov 19, 2024 •

edited

Loading