-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dada2 MergePairs : consensus between merging & concatenating reads #799
Conversation
Release 2.9.0
Release 2.10.0
Release 2.11.0
Release 2.12.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dear Thomas and colleagues,
that seems like a great addition! Thanks a lot for working on it and opening a PR!
It seems like you did some benchmarking, would you have a benchmarking dataset that could be used for routine CI tests (maybe after downsampling)?
I have a few reservations with the implementation though:
- The history seems to be messed up, maybe this is based on an older fork/branch? This shows in missing parameters and therefore failing CI tests. This also blocks me from testing the code myself. This is a breaking issue.
- The new parameters are very generic for sequence comparisons, maybe it would be good to name them more specifically, e.g. by a prefix
- A little more documentation would be great
More details to that points below.
nextflow_schema.json
Outdated
}, | ||
"match": { | ||
"type": "integer", | ||
"description": "", | ||
"help_text": "" | ||
}, | ||
"mismatch": { | ||
"type": "integer", | ||
"description": "", | ||
"help_text": "" | ||
}, | ||
"gap": { | ||
"type": "integer", | ||
"description": "", | ||
"help_text": "" | ||
}, | ||
"minoverlap": { | ||
"type": "integer", | ||
"description": "", | ||
"help_text": "" | ||
}, | ||
"maxmismatch": { | ||
"type": "integer", | ||
"description": "", | ||
"help_text": "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A description to each parameter would be great.
I am not sure about the parameter names, they are very generic for any sequence comparison. Some might fit also taxonomic classifications or future additions. So I think it would be helpful to prepend by a prefix, e.g. asv_
or such (and in case yes, also prepend concatenate_reads
)?
nextflow.config
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change here seems unnecessarily excessive, also at least save_intermediates
is just lost (ancombc params are also not there)? I assume they are from outdated files? Please revert and only add the new parameters.
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CHANGELOG.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would need an update with a table of new params
assets/multiqc_config.yml
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That should not be changed, maybe something messed with the history?
|
||
min_overlap_obs <- Reduce(c, min_overlap_obs) | ||
min_overlap_obs <- min_overlap_obs[!is.na(min_overlap_obs)] | ||
min_overlap_obs <- quantile(min_overlap_obs, 0.001) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is 0.001 hardcoded? Would it be beneficial to have that as a param or, if not, accessible via the config, i.e. via conf/modules.config
?
Edit: maybe as def args3 = task.ext.args3 ?: '0.001'
?
} | ||
|
||
# define the overlap threshold to decide if concatenation or not | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for me, there are some empty lines too much in there, could you remove some?
Switch to #803 due to git history messing |
Authors
Context & description
This pull request introduces an enhancement to the
DADA2::mergePairs()
process within the pipeline, enabling conditional merge or concatenation of sequences based on the overlap between forward and reverse reads. Previously, the pipeline allowed only to use either one of the two methods (merging or concatenating) of paired-end reads, via the--concatenate_reads
parameter. This enhancement introduces the "consensus" method, allowing the pipeline to dynamically determine the appropriate method—merging or concatenation—thereby improving sequence assembly accuracy and downstream analysis outcomes.The core enhancement revolves around incorporating conditional logic to assess the overlap between paired-end reads and decide whether to merge or concatenate them. This decision is based on a specified overlap threshold, ensuring that only reads with adequate overlap are merged, while others are concatenated with a defined spacer.
Enhancement Highlights:
justConcatenate = FALSE
to attempt merging where possible.justConcatenate = TRUE
to concatenate reads where merging isn't feasible.min_overlap_obs
) based on accepted mergers.quantile(min_overlap_obs, 0.001)
) to determine a stringent cutoff. This ensures that only read pairs with exceptionally high overlap are merged into consensus sequences, while those with insufficient overlap are concatenated, thereby maintaining sequence accuracy and integrity.Parameters changed or introduced
concatenate_reads
minoverlap
maxmismatch
gap
match
mismatch
This approach is particularly beneficial for datasets containing both prokaryotic and eukaryotic sequences, where lengths vary, leading to differing overlap extents.
PR checklist
nf-core pipelines lint
).nextflow run . -profile test,docker --outdir <OUTDIR>
).nextflow run . -profile debug,test,docker --outdir <OUTDIR>
).docs/usage.md
is updated.docs/output.md
is updated.CHANGELOG.md
is updated.README.md
is updated (including new tool citations and authors/contributors).