Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

YAML-based contrast definition #370

Open
grst opened this issue Dec 3, 2024 · 14 comments
Open

YAML-based contrast definition #370

grst opened this issue Dec 3, 2024 · 14 comments
Assignees
Labels
enhancement New feature or request

Comments

@grst
Copy link
Member

grst commented Dec 3, 2024

Description of feature

In #362, we agreed that a yaml-based contrast sheet is more flexible than a csv-based sheet and is better suited to cover future developments of the pipeline. The aim of this issue is to come up with a specification of this format that shall then be defined as a json-schema for validation.

A very minimal version for feature parity with the current contrasts.csv could look something like

models:
  - formula: "~ treatment + response"
    contrasts:
      - id: treatment_A_vs_B
        comparison: ["treatment", "A", "B"]
      - id: response_responder_vs_non_responder
        comparison: ["response", "responder", "non_responder"]

How exactly to specify contrasts will be the topic of a separate issue.

Beyond the minimal information, we could consider including additional parameters here:

models: 
  - formula: ...
    contrasts: ...
    gene_sets:
       - "MSigDB:HALLMARK" # could be queried from omnipath
       - "MSigDB:GO"
       - /path_to_custom_geneset.gmt
    method: DESeq2
    filtering:
      min_abundance: 0.125
      ...
    deseq2:
      lfc_threshold: 1.25
      ...

The question is really what information do want to allow setting at a method/contrast level, and what's fixed globally for the entire pipeline run.
Specifying information locally increases flexibility, but potentially leads to redundant information (need to specify parameters over and over for different models) and increases pipeline complexity (need to keep track of more information from meta rather than accessing global parameters)

CC @apeltzer @tschwarzl @atrigila @nschcolnicov @alanmmobbs93 @suzannejin

@grst grst added the enhancement New feature or request label Dec 3, 2024
@grst
Copy link
Member Author

grst commented Dec 3, 2024

Taking one step back, a first PR could replace contrasts.csv with contrasts.yaml without implementing any new features, such as formula.
In that case the yaml would look somewhat like

contrasts:
      - id: treatment_A_vs_B
        comparison: ["treatment", "A", "B"]
        blocking_factors: ["response"]

It would be a good first step as it already provides the modules (e.g json validation) for furture additions.

@nschcolnicov nschcolnicov self-assigned this Dec 9, 2024
@nschcolnicov
Copy link

@grst Created this POC for just making the transition from contrasts.csv to contrasts.yaml: #382

Tested it using the test_maxquant profile, I provided the following yaml file:

  • Old contrasts.tsv file:
id,variable,reference,target,blocking
genotype_celltype_t1_t2,Celltype,T1,T2,
genotype_celltype_t1_FoB,Celltype,T1,FoB,
genotype_celltype_t1_MZ_fakeBatch,Celltype,T1,MZ,fakeBatch
fakebatch_fakeBatch_b1_b2,fakeBatch,b1,b2,
  • MaxQuant_contrasts.yaml file
contrasts:
  - id: genotype_celltype_t1_t2
    comparison: ["Celltype", "T1", "T2"]

  - id: genotype_celltype_t1_FoB
    comparison: ["Celltype", "T1", "FoB"]

  - id: genotype_celltype_t1_MZ_fakeBatch
    comparison: ["Celltype", "T1", "MZ"]
    blocking_factors: ["fakeBatch"]

  - id: fakebatch_fakeBatch_b1_b2
    comparison: ["fakeBatch", "b1", "b2"]

Running it with
nextflow run ../../main.nf -profile docker,test_maxquant --outdir results -resume --contrasts MaxQuant_contrasts.yaml

This worked ok, I even ran the nf-test for this profile using this PR, passing it the new contrasts yaml file, and it passed without having to update the snaps:
Image

Some caveats that I see:
The current validate_fom_components.R script and a lot of custom functions from the script rely on https://github.com/pinin4fjords/shinyngs, we will have to update this tool, or move some of its functions to custom bin scripts. Also the VALIDATOR process is an nf-core module, so we will also have to update that.

@grst
Copy link
Member Author

grst commented Dec 9, 2024

Thanks @nschcolnicov!

I think shinyngs and differentialabundance are deeply interweaved and both under control from @pinin4fjords. So ultimately, the best way forward seems to update shinyngs rather than duplicating code into the bin folder.

As a next step, could you please create a json schema for contrasts.yaml and introduce some logic to validate it? nf-schema claims it can work also with YAML-based samplesheets. If this works, it might be the most elegant solution.

@nschcolnicov
Copy link

@grst Waiting on a fix to nf-schema plugin to get the yaml validation to work: nextflow-io/nf-schema#79

@pinin4fjords
Copy link
Member

Thanks all! I'd like to avoid bin scripts and keep things associated with the modules.

Happy for anyone to make contributions to shinyngs though!

@nschcolnicov
Copy link

nschcolnicov commented Dec 18, 2024

@grst @pinin4fjords
I created a POC for supporting yaml contrasts file: #382
It currently supports a yaml contrasts file that has this format:

contrasts:
  - id: genotype_celltype_t1_t2
    comparison: ["Celltype", "T1", "T2"]

  - id: genotype_celltype_t1_FoB
    comparison: ["Celltype", "T1", "FoB"]

  - id: genotype_celltype_t1_MZ_fakeBatch
    comparison: ["Celltype", "T1", "MZ"]
    blocking_factors: ["fakeBatch"]

  - id: fakebatch_fakeBatch_b1_b2
    comparison: ["fakeBatch", "b1", "b2"]

I also created an issue in the shinyngs package to include these changes: pinin4fjords/shinyngs#67
And I created a POC PR for the tool as well: pinin4fjords/shinyngs#68

Before proceeding with merging any of these PRs we should align on what exactly is the format that we would lile the .yaml to have. @alanmmobbs93 proposed this format in this ticket #371 (comment):

models:
  - formula: "~ treatment"
    contrasts:
      - id: "treatment_mCherry_hND6"
        comparison: ["treatment", "mCherry", "hND6"]

      - id: "treatment_mCherry_hND6_sample_number"
        comparison: ["treatment", "mCherry", "hND6"]
        blocking_factors: ["sample_number"]

      - id: "treatment234"
        comparison: ["treatment", "mCherry", "hND6"]

I created this bin/ script to be able to test any changes to the validate_fom_components.R script from the shinyngs package: https://github.com/nf-core/differentialabundance/pull/382/files#diff-48cc6b0867b0868e90e7d5cd3e5b52ce4931590fe464e0f1314f9ba5eb972a5d

Once we have aligned on exactly how we want the yaml to look like, we can update the script in the PR, test the pipeline, and once that is done, we can proceed to update the shinyngs tool.

@grst
Copy link
Member Author

grst commented Dec 19, 2024

I think we should first focus on the version without explicit model specification. The model is defined implicitly based on the comparison and blocking_factors.

The format proposed by @alanmmobbs93 will be the next iteration: switching to an explicit model definition. But this will require quite some changes also downstream in the pipeline, so in the interest of making the review process by @pinin4fjords smoother, I suggest to separate these two steps.

EDIT: how are we doing with respect to the nf-schema issue?

@pinin4fjords
Copy link
Member

Yep, agreed, always good to separate things that way

@nschcolnicov
Copy link

  1. Moving forward with creating a local version of the "shinyngs/validatefomcomponents" module that will use a local bin/ script instead of the one coming from the tool.
  2. Add test profiles that use a yaml contrasts file in the nf-tests. Adding yaml contrasts file to test-datasets repository: nf-core/test-datasets@2a320ce
  3. Rebasing PR towards dev_tmp

@pinin4fjords
Copy link
Member

  1. Moving forward with creating a local version of the "shinyngs/validatefomcomponents" module that will use a local bin/ script instead of the one coming from the tool.

I'd prefer we just updated it in shinyngs. I can do the release legwork etc.

@nschcolnicov
Copy link

Hi @pinin4fjords!
The steps to do this would be the following:

  1. Merge the shinyngs PR, I opened one a few weeks ago in case we would follow the approach you mention, I'll add you as a reviewer: https://github.com/pinin4fjords/shinyngs/pull/68/files#diff-e75bd0106bcc8840c00f1505e58ddb3d251aa200b5316df7106a9d3183798561
  2. Release a new shinyngs version.
  3. Update the conda recipe so we can create wave containers from it.
  4. Create a shinyngs module PR in the modules repo.
  5. Finally, create a PR for differential abundance for removing the bin script and updating the module.

Keep in mind that we are looking into adding more changes to the script in the near future, so we will likely need to repeat this process multiple times.
Because of the many steps involved and the amount of PRs needed to do this, I don't think this would be an efficient approach, and I would prefer to keep the custom bin script until we settle on a final version.
Is there a particular reason why you prefer having shinyngs updated on this stage of development?

@pinin4fjords
Copy link
Member

Yeah, a couple of reasons:

  • The creation of parallel versions of the same script. That brings the risk of drift and divergence.
  • I just don't like the separation of module and code - that's why there isn't a bin dir in the workflow already.

Appreciate it's a pain development-wise, but it's nice for production. I've resisted allowing others to create new local components for the same reasons, so it wouldn't be fair for me to not object here as well. Hopefully we can bundle the changes on the shinyngs side so there aren't too many cycles of this.

Is your PR ready for review? I was watching it, but it's marked as draft currently.

@nschcolnicov
Copy link

@pinin4fjords I see, ok makes sense then! Let me review it, I just converted it into a PR and I already see an error in the CI tests. I'll address this issue and tag you once its ready for review

@pinin4fjords
Copy link
Member

Thanks! I was OOO yesterday, but will take a look ASAP

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants