Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More flexible model and contrast definition #362

Open
Tracked by #385
grst opened this issue Dec 2, 2024 · 11 comments
Open
Tracked by #385

More flexible model and contrast definition #362

grst opened this issue Dec 2, 2024 · 11 comments
Assignees
Labels
enhancement New feature or request

Comments

@grst
Copy link
Member

grst commented Dec 2, 2024

Description of feature

Currently, models and contrasts to compare are defined in the contrast.csv file as follows:

id,variable,reference,target,blocking

This limits contrasts to pairwise comparisons of a single variable and the model definition is implicit based on the variable and blocking variables. More complex comparisons such as interaction terms or mixed effects are not possible or require workarounds like constructing artificial variables combining multiple variables.

I propose to make the model definition more flexible by

  • explicitly specifying the model using a Wilkinson Formula (e.g. ~ treatment + response (variable + covariate) , ~ treatment * response (interaction model). This would also extend to mixed effects models, should they be added in the future, e.g. ~ treatment * timepoint + (1 | patient_id)
  • Define one or multiple contrasts per model. Multiple tests can be performed for a single model. This is also computationally more efficient since the model only needs to be fitted once.

How to define contrasts

There are different ways to define contrasts beyond simple comparisons. We need to discuss which ones to support and how.

  1. (variable, baseline, target) tuples -> I'd definitely keep this one way or another as this cover a lot of cases and is very intutitive to define. However, we still need alternatives for more complex cases
  2. Specifying coefficient names, e.g. treatmenta:responseresponder
  3. Contrast formula, e.g. responseresponder - responsenon_responder, or cond(response="responder") - cond(response="non_responder")
  4. Contrast vector (the most general case), e.g. (0, -1, 1, 0, 0, 0, 0)

The challenge for 2-3 is to define this in a way that's method-agnostic. We'd likely need a step that converts a coefficient name or contrast formula into a contrast vector. One such way is limma's makeContrasts function. Another way is the cond() method implemented in glmGamPoi:

cond(response="responder", treatment="A") - cond(response="non_responder", treatment="B")

If anyone's aware of other alternatives, I'm happy to consider them.

Configuration format

In principle it would still be possible to use a CSV file, e.g.

id formula contrast comment
treatment_a_vs_b ~ treatment + response treatment;a;b simple comparison
response_responder_vs_non_responder ~ treatment + response response;responder;non_responder same model, different contrast
response_responder_vs_non_responder2 ~ treatment + response responseresponder - responsenon_responder same comparison, different way to specify contrast
treatment_response_interaction ~ treatment * response treatmenta:responseresponder listing a coefficient rather than a comparison

Alternatively, I could see benefits from switching to json/yaml with an appropriate json schema for validation. IMO multiple contrasts per model and different ways to specify contrasts could be better represented with a hierarchical configuration format than csv. Having ;-separated fields within a csv column as it is currently implemented for blocking is a bit of a red flag as it's hard to read and hard to validate.

models:
  - formula: "~ treatment + response"
    contrasts: 
      - id: treatment_a_vs_b
        type: simple
        comparison: ["treatment", "A", "B"]
      - id: response_responder_vs_non_responder2
        type: formula
        comparison:  responseresponder - responsenon_responder
      - id: response_responder_vs_non_responder2
        type: cond
        comparison:  "cond(response="responder") - cond(response="non_responder")"
  - formula: "~ treatment * response"
    contrasts:
      - id: treatment_response_interaction
        type: coefficient
        comparison: treatmenta:responseresponder

LMK what you think

CC @apeltzer @tschwarzl @nschcolnicov @atrigila @alanmmobbs93
FYI @suzannejin

@pinin4fjords
Copy link
Member

Thanks for the issue! Feedback:

  • The YAML file seems to be a no-brainer for me, think that would work well. That might be a good place to start to provide a basis for further work.
  • Support for model types beyond what is currently possible can only be a good thing, and I would support changes to the interface necessary to accomplish that.

But:

  • I would not be in favour of supporting multiple ways of specifying the same things for the sake of it, only where necessary for the new use cases. We should pick the simplest extension to (or if necessary, replacement of) the existing interface (in YML form).
  • I've been a little worried about the suggestion of explicit model specification in the past, just because it would be easy for a user to supply a model incompatible with a sample sheet (for example). So if we allow explicit model input we also need to take care to have validation for models to explain to users what's wrong early-on.
    • A model validation step would actually be useful even now, since people e.g. try to use variables as batch effects when they are associated with main contrast variables.

I'd suggest a couple of POC PRs, and going from there.

@grst
Copy link
Member Author

grst commented Dec 2, 2024

Support for model types beyond what is currently possible can only be a good thing, and I would support changes to the interface necessary to accomplish that.

I agree, we should settle on 1, max. 2 options. My point was mainly to provide an overview of what would be possible.

I've been a little worried about the suggestion of explicit model specification in the past, just because it would be easy for a user to supply a model incompatible with a sample sheet (for example). So if we allow explicit model input we also need to take care to have validation for models to explain to users what's wrong early-on.
A model validation step would actually be useful even now, since people e.g. try to use variables as batch effects when they are associated with main contrast variables.

What kind of validation steps do you invisage here? Wouldn't such a model fail anyway because the design matrix is not full-rank?


From your response I deduce the following follow-up tasks

@pinin4fjords
Copy link
Member

pinin4fjords commented Dec 2, 2024

What kind of validation steps do you invisage here? Wouldn't such a model fail anyway because the design matrix is not full-rank?

It's just that those failures are hard for users to interpret. We should start building a up a set of validation checks to make debugging easier. For example, specifying a model with variables that don't exist in the sample sheet should produce an error telling the user that.

There's already a validation step that e.g. checks the contrasts are compatible with the sample sheet, we could maybe add model checks there as well, to prevent adding a new step https://github.com/pinin4fjords/shinyngs/blob/develop/exec/validate_fom_components.R

@suzannejin
Copy link

Hello @grst! Thank you for proposing this! It looks good to me.
Just one comment... is it possible to do something like this if required?

- id: treatment_a_vs_b_vs_c
  type: simple
  comparison: ["treatment", "A", "B", "C"]

This is because the method that we are working on (ie. propd) can work with multiple conditions (even though now this is not allowed in the module, so that it remains coherent with the current contrast implementation).

@nschcolnicov nschcolnicov self-assigned this Dec 4, 2024
@grst
Copy link
Member Author

grst commented Dec 4, 2024

Hi @suzannejin,

I'll post a separate issue to discuss the contrast specification in more detail and we can certainly take it up there. Potentially we also need to accomondate different ways of specifying contrasts for different methods.

About your specific example

 ["treatment", "A", "B", "C"]

what would that mean? Compare both B and C separately to the baseline A? Or all against all?

@suzannejin
Copy link

suzannejin commented Dec 4, 2024

@grst Sure! The flexibility to accomodate different ways could be nice.

In the specific case I was mentioning above is more for all against all. It was just to mention this as potential option to consider, but it is not so important to have this option for the moment though, as it asks a slightly different question: is a gene changing across the different conditions.

@grst
Copy link
Member Author

grst commented Dec 4, 2024

as it asks a slightly different question: is a gene changing across the different conditions.

so basically the ANOVA case?

@suzannejin
Copy link

as it asks a slightly different question: is a gene changing across the different conditions.

so basically the ANOVA case?

yes exactly!

@grst
Copy link
Member Author

grst commented Dec 10, 2024

Suggested way forward:

  1. YAML-based contrast sheet MVP -> just switch to YAML + JSON schema, no other changes (POC contrasts csv -> yaml #382)
  2. Model validation MVP (Model validation #371)
  3. Formula MVP: Switch to formula instead of blocking factor (no additional functionality, just switch to formula)

Then work on additional contrast types (#377).

@tschwarzl
Copy link

as it asks a slightly different question: is a gene changing across the different conditions.

so basically the ANOVA case?

yes exactly!

Those use cases would be a very powerful addition. However, I would like to highlight, that the handling and interpretation of log2 fold need careful examination. Typically, the significance estimate (p-value) is used for analysis in these cases. To ensure interpretable log2 values, it is often necessary to perform an additional pairwise test. This consideration applies to use cases modeled with both full and reduced models, encompassing the mentioned example as well.

@grst
Copy link
Member Author

grst commented Dec 20, 2024

@nschcolnicov @alanmmobbs93 to move this forward over the christmas break:

@grst grst mentioned this issue Jan 21, 2025
11 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: ToDo - high priority
Development

No branches or pull requests

5 participants