Selection of model examples #106

lionelkusch · 2024-12-30T08:03:53Z

One aspect is that we won't create one example for each function of the library. Examples are here to guide users toward the main information. They should not be exhaustive.
Please also remember that structure is here to help by providing guidelines and a common understanding. What matters first is functionality, clarity of the material and making maintenance easy.

Originally posted by @bthirion in #104 (comment)

lionelkusch · 2024-12-30T08:08:11Z

What are the criteria for creating an example associated with a specific model?

bthirion · 2024-12-30T17:24:44Z

A combo of : usefulness, and advocate the methods with good empirical behavior (error control, power...)

lionelkusch · 2024-12-31T09:38:04Z

The usefulness is quite difficult to estimate and it's subjective.

For empirical behaviour, not all the models can propose, theoretically, a boundary for error control and power.
Moreover, I think it will be dependent on this empirical behaviour and will depend on data use (linear/non-linear, correlate or not ....).
I don't think there are good metrics to evaluate the model because it's based on data and we don't have a set of datasets as a reference to evaluate the different models.

bthirion · 2025-01-02T07:36:18Z

You're right to some extent. Still, we know understand why some methods fail. For instance, basic permutation importance does not estimate a proper variable importance measure. We can include it for historical reference, but clearly, we should not advocate it.

lionelkusch · 2025-01-02T09:47:38Z

For the failure of a method, it's highly dependent on the context.
I perhaps wrong but if the permutation importance will work perfectly, if the data are uncorrelated and follow a normal law. This type of data is quite rare in reality. In consequence, if you want a model which works empirically, this model is not a good model to select.

I have two problems with this formulation.
My first problem is that there is the assumption that the empirical data shares some characteristics of their distribution. For me, this is false and the most common way to approach to avoid this assumption is to use the central limit theorem but I don't think this estimation can be applied in this context.
My second problem is that there won't be a general method for every dataset or this method will be too complex or flexible to be useful. Moreover, all methods will fail in some context.

To avoid these problems and if we still want to base our evaluation on empirical behaviour, I will need to know what the context is for this library, i.e. properties of data (characteristic of the distribution, number of features, number of samples, linearity, correlation, ...).

bthirion · 2025-01-02T17:31:46Z

We don't want to point out particular datasets, but classes of problems:

low-dim problems
high dim problems
very high dim problems with structure.

Some assumptions are always unreasonable, because they're too restrictive : e.g. independence of columns of X and Gaussianity. They have been introduced historically for mathematical convenience, but nobody wants to rely on that.

lionelkusch · 2025-01-02T18:56:37Z

I don't know this type of classification.
Can you provide me a definition of this different class of problems?

bthirion · 2025-01-03T20:22:14Z

Ha, ha there is no formal definition.

low-dim means that you have a handful of variables, maybe a few tens. Much of the literature deals with that, eg https://christophm.github.io/interpretable-ml-book/
high-dim means that the number is much larger, typically of the same order as the number of samples. In this kinds of problem, you rather want to control the rate of false detections (the FDR, not the FPR nor the FWER). This is where knockoffs become useful. CPI and Loco are often used in such contexts (but they can be used is nlow-dim contexts). Quite often, you want to group the features for statistical power and computation efficiency.
very high dim problems have number of features larger than the number of samples. You must use data reductions, and thus cannot grant that you will reach exact inference on each feature. You rather want to localize the spots of importance in the feature space. The feature space often has an intrinsic structure (image, likage desequilibrium...) that should be leveraged for dimension reduction.
HTH

lionelkusch · 2025-01-06T08:33:11Z

If I correctly understand, we can summarise that broadly that some models:

Some models handle only the case when the number of features is lower than the number of samples.
Other models can handle data when the features are the same order as the number of samples.
Other models can handle data when the features are larger than the number of samples using intrinsic structure to reduce the dimension.
Link to the issue Dataset folder #92. Do you a dataset for each case?

For starting example, if you need to choose one model by class, what will be the model?

If I correctly understand, the evaluation of these three types of problems will be different.
Do you have some metrics for these three classes of problems, which can be used for comparing models from the same class?
This will be linked to the benchmark.

bthirion · 2025-01-06T14:52:07Z

low dim: titanic, californian housing, diabetes, breats cancer wisconsin,
high dim: https://archive.ics.uci.edu/dataset/240/human+activity+recognition+using+smartphones, MNIST
very high dim + structure: brain imaging, genetics,...

lionelkusch · 2025-01-06T15:14:08Z

Do you want to use all these dataset?

For starting example, if you need to choose one model by class, what will be the model?

Do you have some metrics for these three classes of problems, which can be used for comparing models from the same class?

bthirion · 2025-01-06T15:38:07Z

Probably not, which means that we have to try and see. what matters is that we can showcase the methods properly.

lionelkusch added the examples Question link to the examples label Dec 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Selection of model examples #106

Selection of model examples #106

lionelkusch commented Dec 30, 2024

lionelkusch commented Dec 30, 2024

bthirion commented Dec 30, 2024

lionelkusch commented Dec 31, 2024

bthirion commented Jan 2, 2025

lionelkusch commented Jan 2, 2025

bthirion commented Jan 2, 2025

lionelkusch commented Jan 2, 2025

bthirion commented Jan 3, 2025

lionelkusch commented Jan 6, 2025

bthirion commented Jan 6, 2025

lionelkusch commented Jan 6, 2025

bthirion commented Jan 6, 2025

Selection of model examples #106

Selection of model examples #106

Comments

lionelkusch commented Dec 30, 2024

lionelkusch commented Dec 30, 2024

bthirion commented Dec 30, 2024

lionelkusch commented Dec 31, 2024

bthirion commented Jan 2, 2025

lionelkusch commented Jan 2, 2025

bthirion commented Jan 2, 2025

lionelkusch commented Jan 2, 2025

bthirion commented Jan 3, 2025

lionelkusch commented Jan 6, 2025

bthirion commented Jan 6, 2025

lionelkusch commented Jan 6, 2025

bthirion commented Jan 6, 2025