Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dada2 MergePairs : consensus between merging & concatenating reads #799

Closed
wants to merge 16 commits into from

Conversation

weber8thomas
Copy link

@weber8thomas weber8thomas commented Nov 18, 2024

Authors

  • @weber8thomas - Thomas Weber (EMBL Heidelberg)- Technical implementation, method parameter configuration accessible from the pipeline config, testing, and ensuring compliance with nf-core requirements.
  • @nhenry50 - Nicolas Henry (Station Biologique de Roscoff) - Concept & Methods development, algorithm design, benchmarking
  • @lplanat - Laurine Planat (EMBL Heidelberg)- Testing, feedback provision, benchmarking
  • @ FloraVincent - Flora Vincent (EMBL Heidelberg) - Conceptualisation, project supervision

Context & description

This pull request introduces an enhancement to the DADA2::mergePairs() process within the pipeline, enabling conditional merge or concatenation of sequences based on the overlap between forward and reverse reads. Previously, the pipeline allowed only to use either one of the two methods (merging or concatenating) of paired-end reads, via the --concatenate_reads parameter. This enhancement introduces the "consensus" method, allowing the pipeline to dynamically determine the appropriate method—merging or concatenation—thereby improving sequence assembly accuracy and downstream analysis outcomes.

The core enhancement revolves around incorporating conditional logic to assess the overlap between paired-end reads and decide whether to merge or concatenate them. This decision is based on a specified overlap threshold, ensuring that only reads with adequate overlap are merged, while others are concatenated with a defined spacer.

Enhancement Highlights:

  • Dual Invocation of mergePairs:
    • Merging: Invoked with justConcatenate = FALSE to attempt merging where possible.
    • Concatenation: Invoked with justConcatenate = TRUE to concatenate reads where merging isn't feasible.
  • Overlap Threshold Calculation:
    • Calculates a minimum overlap threshold (min_overlap_obs) based on accepted mergers.
    • Utilizes the 0.1th percentile (quantile(min_overlap_obs, 0.001)) to determine a stringent cutoff. This ensures that only read pairs with exceptionally high overlap are merged into consensus sequences, while those with insufficient overlap are concatenated, thereby maintaining sequence accuracy and integrity.
  • Conditional Replacement:
    • Iterates through each sample's mergers.
    • Replaces non-accepted mergers with concatenated sequences if the overlap falls below the threshold.
    • Filters out any non-accepted, non-concatenated sequences to maintain data integrity.

Parameters changed or introduced

  • concatenate_reads

    • Options:
      • TRUE: Enable concatenation (already existing)
      • FALSE: Disable concatenation (already existing)
      • "consensus": Enables conditional merging or concatenation based on the overlap between reads.
  • minoverlap

    • Description: Sets the minimum required overlap length for merging paired-end reads.
    • Default Value: 12
    • Usage: Determines the threshold below which reads will not be merged and may be concatenated instead.
  • maxmismatch

    • Description: Defines the maximum allowed mismatches during the merging process.
    • Default Value: 0
    • Usage: Controls the stringency of the merging criteria by limiting the number of mismatches permitted between overlapping regions.
  • gap

    • Description: Specifies the gap penalty used during the alignment process in merging.
    • Default Value: -64
    • Usage: Influences the alignment algorithm's handling of gaps, affecting the quality of merged sequences.
  • match

    • Description: Sets the match score for the alignment algorithm.
    • Default Value: 1
    • Usage: Determines the scoring for matching bases during the alignment, impacting the alignment sensitivity.
  • mismatch

    • Description: Sets the mismatch penalty for the alignment algorithm.
    • Default Value: -64
    • Usage: Determines the penalty for mismatched bases during alignment, affecting the alignment specificity.

This approach is particularly beneficial for datasets containing both prokaryotic and eukaryotic sequences, where lengths vary, leading to differing overlap extents.

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • If necessary, also make a PR on the nf-core/ampliseq branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core pipelines lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Check for unexpected warnings in debug mode (nextflow run . -profile debug,test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

@weber8thomas weber8thomas changed the title V0.0.3 params Dada2 MergePairs : consensus between merging & concatenating reads Nov 18, 2024
Copy link
Collaborator

@d4straub d4straub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dear Thomas and colleagues,

that seems like a great addition! Thanks a lot for working on it and opening a PR!

It seems like you did some benchmarking, would you have a benchmarking dataset that could be used for routine CI tests (maybe after downsampling)?

I have a few reservations with the implementation though:

  • The history seems to be messed up, maybe this is based on an older fork/branch? This shows in missing parameters and therefore failing CI tests. This also blocks me from testing the code myself. This is a breaking issue.
  • The new parameters are very generic for sequence comparisons, maybe it would be good to name them more specifically, e.g. by a prefix
  • A little more documentation would be great

More details to that points below.

Comment on lines 270 to 294
},
"match": {
"type": "integer",
"description": "",
"help_text": ""
},
"mismatch": {
"type": "integer",
"description": "",
"help_text": ""
},
"gap": {
"type": "integer",
"description": "",
"help_text": ""
},
"minoverlap": {
"type": "integer",
"description": "",
"help_text": ""
},
"maxmismatch": {
"type": "integer",
"description": "",
"help_text": ""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A description to each parameter would be great.

I am not sure about the parameter names, they are very generic for any sequence comparison. Some might fit also taxonomic classifications or future additions. So I think it would be helpful to prepend by a prefix, e.g. asv_ or such (and in case yes, also prepend concatenate_reads)?

nextflow.config Outdated
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change here seems unnecessarily excessive, also at least save_intermediates is just lost (ancombc params are also not there)? I assume they are from outdated files? Please revert and only add the new parameters.

Comment on lines +240 to +241


Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

CHANGELOG.md Outdated
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would need an update with a table of new params

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should not be changed, maybe something messed with the history?


min_overlap_obs <- Reduce(c, min_overlap_obs)
min_overlap_obs <- min_overlap_obs[!is.na(min_overlap_obs)]
min_overlap_obs <- quantile(min_overlap_obs, 0.001)
Copy link
Collaborator

@d4straub d4straub Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is 0.001 hardcoded? Would it be beneficial to have that as a param or, if not, accessible via the config, i.e. via conf/modules.config ?
Edit: maybe as def args3 = task.ext.args3 ?: '0.001'?

}

# define the overlap threshold to decide if concatenation or not

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for me, there are some empty lines too much in there, could you remove some?

@weber8thomas
Copy link
Author

Switch to #803 due to git history messing

@weber8thomas weber8thomas deleted the v0.0.3-params branch November 20, 2024 13:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants