Model validation #371

grst · 2024-12-03T13:20:06Z

Description of feature

This issue is to follow up on the request in #362 to implement model validation. The goal is to catch any errors in the model definition/contrast specification as early as possible for a fast feedback loop.

List of things to check (feel free to add items):

columns from design matrix are present in sample metadata
design matrix is full rank
contrasts are valid (i.e. specified coefficients occur as columns in the design matrix)
no NAs in any of the columns used in the design
warn about continuous covariates (I remember a case where numeric patient IDs were considered as a continuous covariate)
values of specified columns don't contain any special characters that the pipeline can't handle (e.g. PR/CR as a factor level fails downstream in clusterProfiler because of the /).

Implementation

Probably convenient to do it in R... but could we even do this in groovy directly? Then errors would be instant and we woudn't need to wait for a process to be fired up.

CC @apeltzer @tschwarzl @atrigila @alanmmobbs93 @nschcolnicov

The text was updated successfully, but these errors were encountered:

alanmmobbs93 · 2024-12-18T14:45:20Z

POC
Create a validation step to cross information between the YML (models and contrast information) and the sample sheet.

Features:

Inputs:

YML
Samplesheet
sample_id_col

Outputs:

Validated phenotypic table that contains only the columns that were required from the YML file
models.json that contains info about full rank models
Warnings JSON file, in case we'd like to report it in the workflow.

Functionalities:

All variables included in the YML file must be present in the sample sheet as column (first component of the contrast definitions, a better solution would be to detect it from the formula if we decide to keep it in the yml).
Blocking factors are also checked for existence.
All levels declared for the variables (extracted from contrast field) must be present in the column. If there are more levels in the samplesheet, they are reported with a warning.
Control special characters.
Control missing values.
Models are constructed with base R functions, and contrast are generated in order to check if the models will be full ranked or not. In case they are not, warnings are generated. We can decide whether to report this at nextflow level or not.

Perspectives:

Easy to include more fields if they are added to the yml.

Testing

I'd like to, and invite everyone, to test it with real cases to check the flexibility in reading variables from the YML file and finding real errors.
Some errors should be found in the YML validation previous to this one.
The local module and basic nf-test was also added for future changes and easy comparison during development.

Test

The following example files are part of the nf-test that can be executed as declared below. They were obtained from the pipeline's test profile.

nf-test test modules/local/validatemodel/tests/main.nf.test --debug --profile docker

Example YML
This fake yml file was generated after the reference contrast file. It's (temporary) located within the tests/ folder of the module.

models:
  - formula: "~ treatment"
    contrasts:
      - id: "treatment_mCherry_hND6"
        comparison: ["treatment", "mCherry", "hND6"]

      - id: "treatment_mCherry_hND6_sample_number"
        comparison: ["treatment", "mCherry", "hND6"]
        blocking_factors: ["sample_number"]

      - id: "treatment234"
        comparison: ["treatment", "mCherry", "hND6"]

Note: Check that I added the "formula" field, compared to @nschcolnicov POC for the YML validation. The script uses it to iterate over, and adds the blocking factors when required. But it can be adjusted if we want to remove it. If we decide to keep the formula, it will simplify the comparison field by removing the first part.

Example sample sheet
Matching sample sheet can be found in:

https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/mus_musculus/rnaseq_expression/SRP254919.samplesheet.csv

Run script
Make the script executable and run it:

validate_model.R \
        --yml path/to/yml \
        --samplesheet path/to/samplesheet \
        --sample_id_col 'sample'

grst · 2024-12-19T10:44:13Z

If we decide to keep the formula, it will simplify the comparison field by removing the first part.

blocking factor and formula are mutually exclusive. Ultimately, we want to get rid of the blocking factor and always specify the formula explicitly. Keeping the blocking factors was meant as an intermediate step to keep the changes to the pipeline atomic.

alanmmobbs93 · 2024-12-19T17:05:13Z

@grst @nschcolnicov The code is already updated for the simpler yml format

contrasts:
  - id: "treatment_mCherry_hND6"
    comparison: ["treatment", "mCherry", "hND6"]

  - id: "treatment_mCherry_hND6_sample_number"
    comparison: ["treatment", "mCherry", "hND6"]
    blocking_factors: ["sample_number"]

  - id: "treatment234"
    comparison: ["treatment", "mCherry", "hND6"]

However, I noticed now that it will always evaluate simple linear models (~ treatment + treatment2 ). We can't specify whether we want to check for interactions terms between variables (~ treatment * treatment2), for example.

grst added the enhancement New feature or request label Dec 3, 2024

grst mentioned this issue Dec 3, 2024

More flexible model and contrast definition #362

Open

grst added this to differentialabundance Dec 3, 2024

grst moved this to ToDo - high priority in differentialabundance Dec 3, 2024

alanmmobbs93 self-assigned this Dec 4, 2024

alanmmobbs93 mentioned this issue Dec 18, 2024

New Feature POC: VALIDATE_MODEL #404

Merged

11 tasks

nschcolnicov mentioned this issue Dec 18, 2024

YAML-based contrast definition #370

Open

alanmmobbs93 closed this as completed Dec 23, 2024

github-project-automation bot moved this from ToDo - high priority to Done in differentialabundance Dec 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model validation #371

Model validation #371

grst commented Dec 3, 2024 •

edited by alanmmobbs93

Loading

alanmmobbs93 commented Dec 18, 2024 •

edited

Loading

grst commented Dec 19, 2024

alanmmobbs93 commented Dec 19, 2024 •

edited

Loading

Model validation #371

Model validation #371

Comments

grst commented Dec 3, 2024 • edited by alanmmobbs93 Loading

Description of feature

Implementation

alanmmobbs93 commented Dec 18, 2024 • edited Loading

Features:

Inputs:

Outputs:

Functionalities:

Perspectives:

Testing

Test

grst commented Dec 19, 2024

alanmmobbs93 commented Dec 19, 2024 • edited Loading

grst commented Dec 3, 2024 •

edited by alanmmobbs93

Loading

alanmmobbs93 commented Dec 18, 2024 •

edited

Loading

alanmmobbs93 commented Dec 19, 2024 •

edited

Loading