Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Selection of model examples #106

Open
lionelkusch opened this issue Dec 30, 2024 · 12 comments
Open

Selection of model examples #106

lionelkusch opened this issue Dec 30, 2024 · 12 comments
Labels
examples Question link to the examples

Comments

@lionelkusch
Copy link
Collaborator

One aspect is that we won't create one example for each function of the library. Examples are here to guide users toward the main information. They should not be exhaustive.
Please also remember that structure is here to help by providing guidelines and a common understanding. What matters first is functionality, clarity of the material and making maintenance easy.

Originally posted by @bthirion in #104 (comment)

@lionelkusch lionelkusch added the examples Question link to the examples label Dec 30, 2024
@lionelkusch
Copy link
Collaborator Author

What are the criteria for creating an example associated with a specific model?

@bthirion
Copy link
Contributor

A combo of : usefulness, and advocate the methods with good empirical behavior (error control, power...)

@lionelkusch
Copy link
Collaborator Author

The usefulness is quite difficult to estimate and it's subjective.

For empirical behaviour, not all the models can propose, theoretically, a boundary for error control and power.
Moreover, I think it will be dependent on this empirical behaviour and will depend on data use (linear/non-linear, correlate or not ....).
I don't think there are good metrics to evaluate the model because it's based on data and we don't have a set of datasets as a reference to evaluate the different models.

@bthirion
Copy link
Contributor

bthirion commented Jan 2, 2025

You're right to some extent. Still, we know understand why some methods fail. For instance, basic permutation importance does not estimate a proper variable importance measure. We can include it for historical reference, but clearly, we should not advocate it.

@lionelkusch
Copy link
Collaborator Author

For the failure of a method, it's highly dependent on the context.
I perhaps wrong but if the permutation importance will work perfectly, if the data are uncorrelated and follow a normal law. This type of data is quite rare in reality. In consequence, if you want a model which works empirically, this model is not a good model to select.

I have two problems with this formulation.
My first problem is that there is the assumption that the empirical data shares some characteristics of their distribution. For me, this is false and the most common way to approach to avoid this assumption is to use the central limit theorem but I don't think this estimation can be applied in this context.
My second problem is that there won't be a general method for every dataset or this method will be too complex or flexible to be useful. Moreover, all methods will fail in some context.

To avoid these problems and if we still want to base our evaluation on empirical behaviour, I will need to know what the context is for this library, i.e. properties of data (characteristic of the distribution, number of features, number of samples, linearity, correlation, ...).

@bthirion
Copy link
Contributor

bthirion commented Jan 2, 2025

We don't want to point out particular datasets, but classes of problems:

  • low-dim problems
  • high dim problems
  • very high dim problems with structure.

Some assumptions are always unreasonable, because they're too restrictive : e.g. independence of columns of X and Gaussianity. They have been introduced historically for mathematical convenience, but nobody wants to rely on that.

@lionelkusch
Copy link
Collaborator Author

I don't know this type of classification.
Can you provide me a definition of this different class of problems?

@bthirion
Copy link
Contributor

bthirion commented Jan 3, 2025

Ha, ha there is no formal definition.

  • low-dim means that you have a handful of variables, maybe a few tens. Much of the literature deals with that, eg https://christophm.github.io/interpretable-ml-book/
  • high-dim means that the number is much larger, typically of the same order as the number of samples. In this kinds of problem, you rather want to control the rate of false detections (the FDR, not the FPR nor the FWER). This is where knockoffs become useful. CPI and Loco are often used in such contexts (but they can be used is nlow-dim contexts). Quite often, you want to group the features for statistical power and computation efficiency.
  • very high dim problems have number of features larger than the number of samples. You must use data reductions, and thus cannot grant that you will reach exact inference on each feature. You rather want to localize the spots of importance in the feature space. The feature space often has an intrinsic structure (image, likage desequilibrium...) that should be leveraged for dimension reduction.
    HTH

@lionelkusch
Copy link
Collaborator Author

If I correctly understand, we can summarise that broadly that some models:

  • Some models handle only the case when the number of features is lower than the number of samples.
  • Other models can handle data when the features are the same order as the number of samples.
  • Other models can handle data when the features are larger than the number of samples using intrinsic structure to reduce the dimension.
    Link to the issue Dataset folder #92. Do you a dataset for each case?

For starting example, if you need to choose one model by class, what will be the model?

If I correctly understand, the evaluation of these three types of problems will be different.
Do you have some metrics for these three classes of problems, which can be used for comparing models from the same class?
This will be linked to the benchmark.

@bthirion
Copy link
Contributor

bthirion commented Jan 6, 2025

@lionelkusch
Copy link
Collaborator Author

Do you want to use all these dataset?

For starting example, if you need to choose one model by class, what will be the model?

Do you have some metrics for these three classes of problems, which can be used for comparing models from the same class?

@bthirion
Copy link
Contributor

bthirion commented Jan 6, 2025

Probably not, which means that we have to try and see. what matters is that we can showcase the methods properly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples Question link to the examples
Projects
None yet
Development

No branches or pull requests

2 participants