Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model validation #371

Closed
6 tasks done
grst opened this issue Dec 3, 2024 · 3 comments
Closed
6 tasks done

Model validation #371

grst opened this issue Dec 3, 2024 · 3 comments
Assignees
Labels
enhancement New feature or request

Comments

@grst
Copy link
Member

grst commented Dec 3, 2024

Description of feature

This issue is to follow up on the request in #362 to implement model validation. The goal is to catch any errors in the model definition/contrast specification as early as possible for a fast feedback loop.

List of things to check (feel free to add items):

  • columns from design matrix are present in sample metadata
  • design matrix is full rank
  • contrasts are valid (i.e. specified coefficients occur as columns in the design matrix)
  • no NAs in any of the columns used in the design
  • warn about continuous covariates (I remember a case where numeric patient IDs were considered as a continuous covariate)
  • values of specified columns don't contain any special characters that the pipeline can't handle (e.g. PR/CR as a factor level fails downstream in clusterProfiler because of the /).

Implementation

Probably convenient to do it in R... but could we even do this in groovy directly? Then errors would be instant and we woudn't need to wait for a process to be fired up.

CC @apeltzer @tschwarzl @atrigila @alanmmobbs93 @nschcolnicov

@grst grst added the enhancement New feature or request label Dec 3, 2024
@grst grst moved this to ToDo - high priority in differentialabundance Dec 3, 2024
@alanmmobbs93 alanmmobbs93 self-assigned this Dec 4, 2024
@alanmmobbs93
Copy link

alanmmobbs93 commented Dec 18, 2024

POC
Create a validation step to cross information between the YML (models and contrast information) and the sample sheet.

Features:

Inputs:

  • YML
  • Samplesheet
  • sample_id_col

Outputs:

  • Validated phenotypic table that contains only the columns that were required from the YML file
  • models.json that contains info about full rank models
  • Warnings JSON file, in case we'd like to report it in the workflow.

Functionalities:

  • All variables included in the YML file must be present in the sample sheet as column (first component of the contrast definitions, a better solution would be to detect it from the formula if we decide to keep it in the yml).
  • Blocking factors are also checked for existence.
  • All levels declared for the variables (extracted from contrast field) must be present in the column. If there are more levels in the samplesheet, they are reported with a warning.
  • Control special characters.
  • Control missing values.
  • Models are constructed with base R functions, and contrast are generated in order to check if the models will be full ranked or not. In case they are not, warnings are generated. We can decide whether to report this at nextflow level or not.

Perspectives:

  • Easy to include more fields if they are added to the yml.

Testing

  • I'd like to, and invite everyone, to test it with real cases to check the flexibility in reading variables from the YML file and finding real errors.
  • Some errors should be found in the YML validation previous to this one.
  • The local module and basic nf-test was also added for future changes and easy comparison during development.

Test

The following example files are part of the nf-test that can be executed as declared below. They were obtained from the pipeline's test profile.

nf-test test modules/local/validatemodel/tests/main.nf.test --debug --profile docker

Example YML
This fake yml file was generated after the reference contrast file. It's (temporary) located within the tests/ folder of the module.

models:
  - formula: "~ treatment"
    contrasts:
      - id: "treatment_mCherry_hND6"
        comparison: ["treatment", "mCherry", "hND6"]

      - id: "treatment_mCherry_hND6_sample_number"
        comparison: ["treatment", "mCherry", "hND6"]
        blocking_factors: ["sample_number"]

      - id: "treatment234"
        comparison: ["treatment", "mCherry", "hND6"]

Note: Check that I added the "formula" field, compared to @nschcolnicov POC for the YML validation. The script uses it to iterate over, and adds the blocking factors when required. But it can be adjusted if we want to remove it. If we decide to keep the formula, it will simplify the comparison field by removing the first part.

Example sample sheet
Matching sample sheet can be found in:

https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/mus_musculus/rnaseq_expression/SRP254919.samplesheet.csv

Run script
Make the script executable and run it:

validate_model.R \
        --yml path/to/yml \
        --samplesheet path/to/samplesheet \
        --sample_id_col 'sample'

@grst
Copy link
Member Author

grst commented Dec 19, 2024

If we decide to keep the formula, it will simplify the comparison field by removing the first part.

blocking factor and formula are mutually exclusive. Ultimately, we want to get rid of the blocking factor and always specify the formula explicitly. Keeping the blocking factors was meant as an intermediate step to keep the changes to the pipeline atomic.

@alanmmobbs93
Copy link

alanmmobbs93 commented Dec 19, 2024

@grst @nschcolnicov The code is already updated for the simpler yml format

contrasts:
  - id: "treatment_mCherry_hND6"
    comparison: ["treatment", "mCherry", "hND6"]

  - id: "treatment_mCherry_hND6_sample_number"
    comparison: ["treatment", "mCherry", "hND6"]
    blocking_factors: ["sample_number"]

  - id: "treatment234"
    comparison: ["treatment", "mCherry", "hND6"]

However, I noticed now that it will always evaluate simple linear models (~ treatment + treatment2 ). We can't specify whether we want to check for interactions terms between variables (~ treatment * treatment2), for example.

@github-project-automation github-project-automation bot moved this from ToDo - high priority to Done in differentialabundance Dec 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Done
Development

No branches or pull requests

2 participants