ADA-SVR (3/4) PR example of models #100

lionelkusch · 2024-12-26T11:33:15Z

This pull request includes modification of other PR (Fix bug for the version of library in configuration file for the documentation #74, ADA-SVR (2/4) add comments and documentation of the functions and test #73, Fix documentation #99)
I want to propose it in order to have a discussion around the documentation of the model through an example.
The main file of this pull request is examples/methods/ada_svr.py

It composes multiple sections:

Import section
Generate toy dataset
Usage methods
Plot the results
Principle of the methods
Assumptions, Advantages and Disadvantages

I would like to generalise to all the models in the library.

You should consider improving the organisation of the files as well as the plots.

Do you prefer to have an issue with talking about it or the discussion on this topic can stay in this PR?

codecov · 2024-12-26T11:35:30Z

Codecov Report

Attention: Patch coverage is 89.47368% with 2 lines in your changes missing coverage. Please review.

Project coverage is 81.66%. Comparing base (b811f2e) to head (6e2d4c9).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
src/hidimstat/ada_svr.py	92.85%	1 Missing ⚠️
src/hidimstat/permutation_test.py	66.66%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #100      +/-   ##
==========================================
- Coverage   81.70%   81.66%   -0.04%     
==========================================
  Files          43       43              
  Lines        2312     2318       +6     
==========================================
+ Hits         1889     1893       +4     
- Misses        423      425       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

bthirion

Thx for opening this. I would try to minimize the communication about this method, as it not rigorous.

examples/_utils/plot_dataset.py

bthirion · 2024-12-26T11:44:40Z

examples/_utils/plot_dataset.py

+    vmax=1.0,
+):
+    """
+    Plot for the confidence in the hipothse that the variables are important.


Suggested change

Plot for the confidence in the hipothse that the variables are important.

Plot the variable importance p-values

bthirion · 2024-12-26T11:47:36Z

examples/_utils/plot_dataset.py

+    plt.subplots_adjust(top=1.0, bottom=0.2)
+
+
+def plot_pvalue_H0(


Maybe better: plot_p_values

bthirion · 2024-12-26T11:47:58Z

examples/_utils/plot_dataset.py

+    - vmax: Maximum value of the colorbar (float)
+
+    Returns:
+    - a figure with 3 subplots


bthirion · 2024-12-26T11:52:26Z

examples/methods/ada_svr.py

+#
+# **Advantages**:
+# 
+# - The method is fast because it uses the central limit theorem to estimate


It is actually fast because it uses linear regression to get the uncertainty estimate.

the CLT is simply there to strengthen some Gaussian assumption. It is here for the sake of theory only.

We can say it's fast because of it's a broad estimate of the weight distribution.

examples/methods/ada_svr.py

hidimstat/ada_svr.py

fix typo errors Co-authored-by: bthirion <[email protected]>

… into PR_example_ADA_SVR

lionelkusch · 2024-12-26T12:52:34Z

examples/methods/ada_svr.py

+# - The method assumes that the distribution of the coefficients of SVR is normal centred around zeros.
+# - The method is not valid for small sample sizes.
+# - The method has all the disadvantages of linear models: only for linear 
+#   relationships, not good predicting performance, unintuitive. 


unintuitive:
Following the explanation of Molnar, the interpretation of the weight can be counter-intuitive when the dimension is higher than 4. (see the last paragraph of linear model )

So you means that the model is "non-sparse", making it harder to figure out.

No, my understanding of the argument of the Molnar is that when the dimension is higher than 4, it's quite impossible to image the hyperplane provided by the weights.

From Molnar's book : "The interpretation of a weight can be unintuitive because it depends on all other features. A feature with high positive correlation with the outcome y and another feature might get a negative weight in the linear model, because, given the other correlated feature, it is negatively correlated with y in the high-dimensional space. Completely correlated features make it even impossible to find a unique solution for the linear equation. An example: You have a model to predict the value of a house and have features like number of rooms and size of the house. House size and number of rooms are highly correlated: the bigger a house is, the more rooms it has. If you take both features into a linear model, it might happen, that the size of the house is the better predictor and gets a large positive weight. The number of rooms might end up getting a negative weight, because, given that a house has the same size, increasing the number of rooms could make it less valuable or the linear equation becomes less stable, when the correlation is too strong."

Sure, but there are 2 ways to consider the problem:

if your data is high-dimensional (dimension > 4) you cannot reliably interpret the weights ---which is True or False , depending on the correlation between features. But then, interpretability is not related to the model, only to the data...

if your model is sparse, i.e. can explain y with very few features, it becomes interpretable again. A Lasso regression or a regression tree are "sparse" in that sense. However, this is not a good idea, because the sparse solution rached by the model may not be significantly better than other sparse solutions.

Overall, I think that this is a bad argument. In any case, it is not related to ada-svr.

lionelkusch · 2024-12-26T12:53:31Z

Thx for opening this. I would try to minimize the communication about this method, as it not rigorous.

I used it only as an example for creating a template for all the other methods.

hidimstat/ada_svr.py

bthirion · 2024-12-27T07:39:31Z

examples/inference_model/ada_svr.py

+pvalue, pvalue_corrected, one_minus_pvalue, one_minus_pvalue_correlation = (
+    ada_svr_pvalue(beta_hat, scale)
+)
+plot_pvalue_H0(


I don't think it's a good idea to have plotting functions as black boxes. See examples from sklearn and nilearn, where this pattern is avoided.

Scikit-learn and nilearn has two different politics for plotting.

Nilearn has a submodule "plotting" where most of the functions for plotting are implemented. In general the only plot which doesn't use these functions is because there are simple, such as a bar plot or line plots, or because there are too specific, such as a specific dataset.

Scikit-learn doesn't have a specific submodule for plotting results. Consequence, most of the plot is not in a black box. There are few exceptions, which are mainly for complex plotting.

I have difficulty potting a trade-off between the readability of the notebook and using a black box function.
The main difference that I identify is that, instead of creating a full figure, the function creates a plot on a specific axis.

In my opinion:

Having predefined functions makes it harder for users to tweak the plotting script to adapt the figure to their constraints (publication template, presentation ... )

The plotting functionalities are currently not too heavy; it wouldn't reduce the readability of the notebook to include them.

There is redundancy in plotting coefficients, p-values, and corrected p-values. Unless you want to highlight what one of these quantities adds on top of the others

Do you propose to only plot corrected p-values?

jpaillard

As a general question: Is using a different template for this method than LOCO / PI / CPI desirable?
A key difference is that it does not use data splits (train/test) which makes .fit and .score unecessary, but also makes the syntax inconsistent across methods.

I would suggest to at least maximize the consistency between methods that do not require data splitting (ada-svr, knockoffs ...)

doc_conf/references.bib

hidimstat/visualisation/plot_dataset.py

jpaillard · 2024-12-30T16:40:11Z

hidimstat/visualisation/plot_dataset.py

+    plt.subplots_adjust(top=1.0, hspace=0.3)
+
+
+def plot_pvalue_H1(


If plotting functions are kept, I think this one should be merged with plot_H0

Also one_minus_pvalue should not be a required argument.

What do you mean by "one_minus_pvalue should not be a required argument." ?

plot_pvalue_H1 takes as arguments (
pvalue,
pvalue_corrected,
one_minus_pvalue,
one_minus_pvalue_corrected)
I one_minus_pvalue could simply be computed inside the function to simplify function calls

I was keeping it because it was the output of the existing function in stat_tools.
By looking more in details, I think we can remove the one_minus_pvalue and one_minus_pvalue_corrected of these functions.

sorry, why should we do that ? This was here for numerical reasons...

We don't need to keep them in memory if we provide a function which deals with this numerical error.

From the code point of view, the computation of 1-pvalue is always the same and it can be seen as a duplicate line of code. I prefer to propose a function than the value.

We can discuss it in issue #107.

jpaillard · 2024-12-30T16:46:06Z

examples/inference_model/ada_svr.py

+pvalue, pvalue_corrected, one_minus_pvalue, one_minus_pvalue_correlation = (
+    ada_svr_pvalue(beta_hat, scale)
+)
+plot_pvalue_H0(


In my opinion:

Having predefined functions makes it harder for users to tweak the plotting script to adapt the figure to their constraints (publication template, presentation ... )

The plotting functionalities are currently not too heavy; it wouldn't reduce the readability of the notebook to include them.

There is redundancy in plotting coefficients, p-values, and corrected p-values. Unless you want to highlight what one of these quantities adds on top of the others

lionelkusch · 2024-12-30T17:06:47Z

As a general question: Is using a different template for this method than LOCO / PI / CPI desirable? A key difference is that it does not use data splits (train/test) which makes .fit and .score unecessary, but also makes the syntax inconsistent across methods.

I would suggest to at least maximize the consistency between methods that do not require data splitting (ada-svr, knockoffs ...)

See the issue #104 for the discussion around this topic.

bthirion · 2024-12-30T18:16:21Z

Note that Ada-SVR is different from Knockoffs and CPI: it is variable importance estimator tied to a given model (essentially a linear regression). In some sense, it is comparable to the desparsified Lasso.
By contrast, KO and CPI are plug-in approaches that can work with many different estimators.

Again, let me reiterate that Ada-SVR is not something we want to put forward. I think that we should not base our APÏ discussions on it.

fixe typo Co-authored-by: Joseph Paillard <[email protected]>

lionelkusch added 19 commits December 19, 2024 11:02

Add documentation to ADA SVR

35a877f

Change name of the file

45d9992

fix bug in example

e28b8b8

Fix some error in conf of sphinx

0a2fab4

Remove all the warning and error for generate docstring

02adbb5

Format files

358bd68

Merge branch 'PR_comment_ADA-SVR' into PR_example_ADA_SVR

fd41d6c

Include methods description in the examples

df80078

fix documentation

2a8f3c4

Add example for ADA-SVR

d3871f4

Add functions for get pvalue and fix format and doctsring

f868469

Fix format

b1c14e1

Add figure for ADA-SVR

7cf0c4d

Add function for plotting elements

71d407e

Fix typo

a530252

remove unecessary line

7ab4bfb

unecessary option

83f049e

Add a section in examples

5887b1b

Fix typo

05f47a7

lionelkusch requested review from bthirion and jpaillard December 26, 2024 11:33

bthirion reviewed Dec 26, 2024

View reviewed changes

lionelkusch and others added 5 commits December 26, 2024 13:40

Apply suggestions from code review

e607676

fix typo errors Co-authored-by: bthirion <[email protected]>

Fix include copyright figure

2e4037a

Fix format of the docstring

ca9575e

Remove a comment of advantages

5c42cc3

Merge remote-tracking branch 'refs/remotes/origin/PR_example_ADA_SVR'…

d73aeb9

… into PR_example_ADA_SVR

lionelkusch commented Dec 26, 2024

View reviewed changes

lionelkusch added 3 commits December 26, 2024 14:03

Change folder for plotting result

ded4d51

Not use example as packages

ebad58c

Change name of the file for methods

64116e9

This was referenced Dec 26, 2024

ADA-SVR (4/4) Test for models #101

Open

ADA-SVR (4/4) bis: adding type and assertion #102

Open

bthirion reviewed Dec 27, 2024

View reviewed changes

lionelkusch mentioned this pull request Dec 27, 2024

Model template #104

Open

jpaillard reviewed Dec 30, 2024

View reviewed changes

lionelkusch and others added 4 commits December 31, 2024 11:52

Update hidimstat/visualisation/plot_dataset.py

16cdad6

fixe typo Co-authored-by: Joseph Paillard <[email protected]>

Merge branch 'main' into PR_example_ADA_SVR

d41554f

Remove previous file

e895962

Fix format

6e2d4c9

lionelkusch mentioned this pull request Jan 2, 2025

Removing one-minus pvalue in stat tools #107

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ADA-SVR (3/4) PR example of models #100

ADA-SVR (3/4) PR example of models #100

lionelkusch commented Dec 26, 2024 •

edited by Remi-Gau

Loading

codecov bot commented Dec 26, 2024 •

edited

Loading

bthirion left a comment

bthirion Dec 26, 2024

bthirion Dec 26, 2024

bthirion Dec 26, 2024

bthirion Dec 26, 2024

bthirion Dec 27, 2024

lionelkusch Dec 27, 2024

lionelkusch Dec 26, 2024

bthirion Dec 27, 2024

lionelkusch Dec 27, 2024

bthirion Dec 27, 2024

lionelkusch commented Dec 26, 2024

bthirion Dec 27, 2024

lionelkusch Dec 27, 2024

jpaillard Dec 30, 2024

lionelkusch Dec 31, 2024

jpaillard left a comment

jpaillard Dec 30, 2024

lionelkusch Dec 31, 2024

jpaillard Jan 2, 2025

lionelkusch Jan 2, 2025

bthirion Jan 2, 2025

lionelkusch Jan 2, 2025

jpaillard Dec 30, 2024

lionelkusch commented Dec 30, 2024

bthirion commented Dec 30, 2024

	Plot for the confidence in the hipothse that the variables are important.
	Plot the variable importance p-values

ADA-SVR (3/4) PR example of models #100

Are you sure you want to change the base?

ADA-SVR (3/4) PR example of models #100

Conversation

lionelkusch commented Dec 26, 2024 • edited by Remi-Gau Loading

codecov bot commented Dec 26, 2024 • edited Loading

Codecov Report

bthirion left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lionelkusch commented Dec 26, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpaillard left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lionelkusch commented Dec 30, 2024

bthirion commented Dec 30, 2024

lionelkusch commented Dec 26, 2024 •

edited by Remi-Gau

Loading

codecov bot commented Dec 26, 2024 •

edited

Loading