Folder organisations #86

lionelkusch · 2024-12-20T09:00:19Z

From the PR #73, we need to have a vision about the organization of the library.

lionelkusch · 2024-12-20T10:01:47Z

My detail vision of it is:

.
+--ReadMe.txt
+--pyproject.toml
+--LICENCE
+--codecov.yml
+--.gitignore
+--doc_conf 
|  +--conf.py (configuration file)
|  +--Makefile (make file for creating the library)
|  +--documentation (folder for additional documentation)
|       +--api.rst (index of the API)
|       +--index.rst (home page of the documentation)
|       +--... (specific pages such as how to contribute, ....)
|       +--references.bib (bibliography reference)
|  +--docs (folder where to build the documentation)
+--examples (folder with examples)
|  +--__init__.py
|  +--figures (folder to save all the generated figures and additional figures)
|      +--generates (folders which contains generated by examples)
|         +--get_started
|              +--find_importance_variable_1.png
|              +--...
|         +--models
|              +--ada_svr_1.png (illustrate the data of the toy_dataset)
|              +--ada_svr_2.png (illustrate the results on the toy_dataset)
|              +--...
|         +--estimators 
|              +--dnn_learner_1.png 
|              +--...
|         +--comparison_models
|              +--geometric_2D_1.png
|              +--benchmarks_1D_dataset_1.png
|              +--...
|      +--external (folder which contains figures to illustrate the examples)
|         +--get_started
|              +--...
|         +--models
|              +--ada_svr_1.py (figure 1of Gaonkar et al. 2012)
|              +--...
|         +--comparison_models
|              +--...
|  +--_utils (folder with the function for examples)
|      +--__init__.py
|      +--plot_function.py (functions for plotting datasets or results)
|      +--....
|  +--get_started (folder for having tutorials or basic examples of how to use Hidimstat)
|      +--find_importance_variable.py (apply one generic method to a dataset)
|      +--....
|  +--models (folder which shows how to use models on toy dataset
                        and after a description of the methods
                        and a description of advantages, disadvantages and assumptions)
|      +--ada_svr.py 
|      +--permutation_test.py
|      +--.....
|  +--estimator (folder which shows how to use an estimator on toy dataset
                        and after a description of the estimator
                        and a description of advantages, disadvantages and assumptions)
|      +--dnn_learner.py 
|      +--.....
|  +--comparison_models (folder where the models are compared 
                                               on toy dataset with a specific characteristic
                                               (linear/non-linear, geometric, ...)
                                               and benchmark)
|      +--geometric_2D.py
|      +--....
|      +--benchmarks (study the speed of each model)
|          +--1D_dataset.py
|          +--scalability.py
|          +--performance p~n.py
|          +--....
+--hidimstat (folder with code)
|  +--__init__.py
|  +--models (folder with models for estimate variables of importance)
|      +--__init__.py
|      +--_utils (folder with functions shared between functions of the sub-packages)
|          +--__init__.py
|          +--scikit-learn_estimator.py (generic function for using the estimator API)
|          +--....
|      +--tests (folder for testing the models)
|          +--__init__.py
|          +--_utils (folder with functions shared between tests)
|              +--__init__.py
|              +--....
|          +--test_ada_svr.py
|          +--test_permutation_test.py
|          +--....
|      +--ada_svr.py
|      +--permutation_test.py
|      +--....
|  +--estimator (folder with specific estimators)
|      +--__init__.py
|      +--_utils (folder with functions shared between functions of the sub-packages)
|          +--__init__.py
|          +--....
|      +--tests (folder for testing the estimators)
|          +--__init__.py
|          +--_utils (folder with functions shared between tests)
|              +--__init__.py
|              +--....
|          +--test_dnn_learner.py
|          +--....
|      +--dnn_learner.py
|      +--....
|  +--extra (folder for generation of toy_dataset and statistics methods)
|      +--__init__.py
|      +--toy_data
|          +--__init__.py
|          +--_utils (folder with functions shared between functions of the subsub-packages)
|              +--__init__.py
|              +--....
|          +--tests (folder for testing the generated function)
|              +--__init__.py
|              +--_utils (folder with functions shared between tests)
|                  +--__init__.py
|                  +--....
|              +--test_1d_dataset.py
|              +--....
|          +--1d_dataset.py
|          +--....
|      +--stat_tools (the folder contains methods for calculating pvalue and calculating error rate)
|          +--__init__.py
|          +--_utils (folder with functions shared between functions of the subsub-packages)
|              +--__init__.py
|              +--....
|          +--tests (folder for testing the generated function)
|              +--__init__.py
|              +--_utils (folder with functions shared between tests)
|                  +--__init__.py
|                  +--....
|              +--test_pval.py
|              +--....
|          +--pval.py
|          +--....

lionelkusch · 2024-12-20T10:07:37Z

It will be impossible to talk about this very detailed vision.
To summarise:

doc_conf (contains all the documentation)
examples (contains all the examples)
hidimstat/models (contains all the models )
hidimstat/estimators (contains all the estimators)
hidimstat/extra (contains functions for the generation of the dataset and statistics methods (pvalue calculus, calculus of error of type I, type II, ...)

lionelkusch · 2024-12-20T10:14:50Z

The doc_conf is composed of:

configuration files and makefile
additional documentation pages

For the moment, there is not a specific organisation of the different nonautogenerated documentation.

lionelkusch · 2024-12-20T10:19:42Z

The examples folders is composed of 5 folders:

figures: organize the generated figure for the documentation and store the figure for helping to understand the examples
get_started: very basic usage of the library
models: example of usage models with some explanation of it
estimators: example of usage estimators with some explanation of it
comparison_models: example of comparison models and benchmarks

lionelkusch · 2024-12-20T10:26:01Z

hidimstat is composed of 3 sub-packages:

estimators
models
extra/toy_data (function for the generation of toy_dataset)
extra/stat_tools (functions for statistical tools)

If sub_packages are composed of a _utils folder (shared function in subpackage), a test folder (for the tests) and the functions.

lionelkusch · 2024-12-20T10:28:06Z

@bthirion @Remi-Gau @jpaillard @man-shu
If you have any suggestions on how to organize the files, I had love to hear from you.

Remi-Gau · 2024-12-20T10:58:44Z

quick things:

adding init.py and utils in examples: won't that make the examples hard to run for users?
since the library is not huge (yet) consider moving to a source layout:
- https://packaging.python.org/en/latest/discussions/src-layout-vs-flat-layout/
- that usually goes along with putting the tests outside of the source code: https://docs.pytest.org/en/stable/explanation/goodpractices.html#tests-outside-application-code

Remi-Gau · 2024-12-20T11:02:40Z

See also: https://learn.scientific-python.org/development/guides/pytest/#tests-should-test-the-installed-version-not-the-local-version

jpaillard · 2024-12-20T11:24:46Z

I have a few suggestions:

The names 'models' and 'estimator' can be confusing since they are sometimes used interchangeably. A more explicit name like 'stat_models' and 'prediction_models' could be more transparent.
For the toy datasets, having a datasets folder would be clearer; it's organized this way in Sklearn and Nilearn
That could avoid having the 'extra' folder, which is not so straightforward. Instead, we have 'datasets' and 'stat_tools.'

lionelkusch · 2024-12-20T11:29:38Z

The toy_dataset won't contain data; it will contain only functions for generating data, in my view.
I don't think it's a good idea to have a dataset for the moment.

man-shu · 2024-12-20T11:55:12Z

My suggestions:

I would keep the names of submodules short (without _). I agree that names models and estimators are confusing. I could suggest alternative names for models and estimators, but I don't understand the difference between the two in the context of the package. Maybe you could point me to some code?
I think it would be good to keep toy_datasets and stat_tools separate as they don't seem to be related and also rename to datasets and stats. Correct me if I am wrong, but I don't think you would provide actual datasets with the package, they would still be the code to fetch some real datasets, so datasets seems appropriate to me.

Remi-Gau · 2024-12-20T13:12:54Z

I would keep the names of submodules short (without _)

except for the _utils folder that make sense when you want to keep things private to a subpackage

2 tools I have used in other projects:

https://pypi.org/project/flake8-private-name-import helps make sure that you don't import private things where they should not be imported
https://import-linter.readthedocs.io/en/stable/readme.html#overview allows to make sure that you keep strict architecture in your package by establishing rules on what each module can import from, can help establish what modules are more 'low level' than others

bthirion · 2024-12-20T21:32:01Z

The examples folders is composed of 5 folders:

* figures: organize the generated figure for the documentation and store the figure for helping to understand the examples

* get_started: very basic usage of the library

* models: example of usage models with some explanation of it

* estimators: example of usage estimators with some explanation of it

* comparison_models: example of comparison models and benchmarks

I think I'd like to start with something simpler while we have very few examples, and reorganize a posteriori depending on the examples we have.

bthirion · 2024-12-20T21:38:00Z

hidimstat is composed of 3 sub-packages:
* estimators

* models

* extra/toy_data (function for the generation of toy_dataset)

* extra/stat_tools (functions for statistical tools)
If sub_packages are composed of a _utils folder (shared function in subpackage), a test folder (for the tests) and the functions.

I would not have too many levels. extra/toy_data and extra/stat_tools should rather be something like a utils module. (extra is not a good name I'm afraid).

bthirion · 2024-12-20T21:42:14Z

The toy_dataset won't contain data; it will contain only functions for generating data, in my view. I don't think it's a good idea to have a dataset for the moment.

Why ?

bthirion · 2024-12-20T21:43:26Z

It will be impossible to talk about this very detailed vision. To summarise:

* doc_conf (contains all the documentation)

* examples (contains all the examples)

* hidimstat/models (contains all the models )

* hidimstat/estimators (contains all the estimators)

* hidimstat/extra (contains functions for the generation of the dataset and statistics methods (pvalue calculus, calculus of error of type I, type II, ...)

Thx for bringing up this discussion !

lionelkusch · 2024-12-23T09:36:18Z

The toy_dataset won't contain data; it will contain only functions for generating data, in my view. I don't think it's a good idea to have a dataset for the moment.

Why ?

We don't have a specific dataset where the data are present required to be stored at the moment.
The datasets used in examples or tests are generated datasets or from other libraries (mne, nilearn or scikit-learn).
For the datasets from other libraries, functions to get data already exist and I want to prioritise only datasets from scitkit-learn to avoid dependence on other libraries. We need only to have a function for the generation dataset based on a random generator.

lionelkusch · 2024-12-23T09:40:24Z

hidimstat is composed of 3 sub-packages:
* estimators

* models

* extra/toy_data (function for the generation of toy_dataset)

* extra/stat_tools (functions for statistical tools)
If sub_packages are composed of a _utils folder (shared function in subpackage), a test folder (for the tests) and the functions.
I would not have too many levels. extra/toy_data and extra/stat_tools should rather be something like a utils module. (extra is not a good name I'm afraid).

I didn't add them to _utils because there are functions required, for example and tests. They shouldn't be private functions. We need to make a difference between side functions, which are public and side functions, which are private.

lionelkusch · 2024-12-23T09:51:45Z

I think I'd like to start with something simpler while we have very few examples, and reorganize a posteriori depending on the examples we have.

In my opinion, there are missing examples; it's why I want to add them.

The example will be here for answering 2 questions to users:

Which methods to use?
What are the methods?
I propose it because I am not aware of a book or a review which lists the different methods and their domain of the application. The only exception is the introduction of the thesis of Ahmad but I don't think it's the best format for popularising this information.

Remi-Gau · 2024-12-23T09:56:31Z

random thought (feel free to ignore): may be easier to have some rules or guideline, that the project should follow regarding folder structure and try try to slowly implement it, rather than trying to find the 'right' structure.

Obviously easier said than done.

lionelkusch · 2024-12-23T10:07:21Z

I don't plan for a brutal refactoring of the project. It's more to have a direction where to move.
I plan to refactor one model by one and change the structure, little by little, at the same time.

bthirion · 2024-12-23T15:46:30Z

The toy_dataset won't contain data; it will contain only functions for generating data, in my view. I don't think it's a good idea to have a dataset for the moment.

Why ?

We don't have a specific dataset where the data are present required to be stored at the moment. The datasets used in examples or tests are generated datasets or from other libraries (mne, nilearn or scikit-learn). For the datasets from other libraries, functions to get data already exist and I want to prioritise only datasets from scitkit-learn to avoid dependence on other libraries. We need only to have a function for the generation dataset based on a random generator.

But I'd like to reuse, as much as possible, public datasets, because they are known to users. Generating data means that you "invent" (at least come up with) the problem together with the solution, which is not great. I'd really like to confine generated data to situations where there is no other possibility.

bthirion · 2024-12-23T15:48:42Z

I think I'd like to start with something simpler while we have very few examples, and reorganize a posteriori depending on the examples we have.

In my opinion, there are missing examples; it's why I want to add them.

The example will be here for answering 2 questions to users:
* Which methods to use?

* What are the methods?
  I propose it because I am not aware of a book or a review which lists the different methods and their domain of the application. The only exception is the introduction of the thesis of Ahmad but I don't think it's the best format for popularising this information.

We should start with https://christophm.github.io/interpretable-ml-book/ and https://shap.readthedocs.io

lionelkusch · 2024-12-24T09:01:58Z

I separate the different discussion in different issues for going in mode details:

If I miss a point or you have a new point, you can open an issue or add a comment here.

lionelkusch · 2024-12-26T15:08:21Z

Based on the issue #93, there won't be a separate folder for "side function".

lionelkusch added management of project question regarding the policy of the project coding style question regarding formatting and declaration of functions labels Dec 20, 2024

lionelkusch mentioned this issue Dec 20, 2024

ADA-SVR (2/4) add comments and documentation of the functions and test #73

Open

This was referenced Dec 24, 2024

Source and Tests layout #90

Closed

Name of subpackages #91

Open

Dataset folder #92

Open

Separate folder for side function? #93

Closed

lionelkusch mentioned this issue Dec 24, 2024

Name _utils #94

Open

lionelkusch added the file organisation the organisation of the different files label Dec 24, 2024

lionelkusch mentioned this issue Dec 31, 2024

Examples organisation #95

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Folder organisations #86

Folder organisations #86

lionelkusch commented Dec 20, 2024

lionelkusch commented Dec 20, 2024 •

edited

Loading

lionelkusch commented Dec 20, 2024 •

edited

Loading

lionelkusch commented Dec 20, 2024

lionelkusch commented Dec 20, 2024 •

edited

Loading

lionelkusch commented Dec 20, 2024

lionelkusch commented Dec 20, 2024 •

edited

Loading

Remi-Gau commented Dec 20, 2024

Remi-Gau commented Dec 20, 2024

jpaillard commented Dec 20, 2024

lionelkusch commented Dec 20, 2024

man-shu commented Dec 20, 2024 •

edited

Loading

Remi-Gau commented Dec 20, 2024

bthirion commented Dec 20, 2024

bthirion commented Dec 20, 2024

bthirion commented Dec 20, 2024

bthirion commented Dec 20, 2024

lionelkusch commented Dec 23, 2024

lionelkusch commented Dec 23, 2024

lionelkusch commented Dec 23, 2024

Remi-Gau commented Dec 23, 2024

lionelkusch commented Dec 23, 2024

bthirion commented Dec 23, 2024

bthirion commented Dec 23, 2024

lionelkusch commented Dec 24, 2024

lionelkusch commented Dec 26, 2024

Folder organisations #86

Folder organisations #86

Comments

lionelkusch commented Dec 20, 2024

lionelkusch commented Dec 20, 2024 • edited Loading

lionelkusch commented Dec 20, 2024 • edited Loading

lionelkusch commented Dec 20, 2024

lionelkusch commented Dec 20, 2024 • edited Loading

lionelkusch commented Dec 20, 2024

lionelkusch commented Dec 20, 2024 • edited Loading

Remi-Gau commented Dec 20, 2024

Remi-Gau commented Dec 20, 2024

jpaillard commented Dec 20, 2024

lionelkusch commented Dec 20, 2024

man-shu commented Dec 20, 2024 • edited Loading

Remi-Gau commented Dec 20, 2024

bthirion commented Dec 20, 2024

bthirion commented Dec 20, 2024

bthirion commented Dec 20, 2024

bthirion commented Dec 20, 2024

lionelkusch commented Dec 23, 2024

lionelkusch commented Dec 23, 2024

lionelkusch commented Dec 23, 2024

Remi-Gau commented Dec 23, 2024

lionelkusch commented Dec 23, 2024

bthirion commented Dec 23, 2024

bthirion commented Dec 23, 2024

lionelkusch commented Dec 24, 2024

lionelkusch commented Dec 26, 2024

lionelkusch commented Dec 20, 2024 •

edited

Loading

lionelkusch commented Dec 20, 2024 •

edited

Loading

lionelkusch commented Dec 20, 2024 •

edited

Loading

lionelkusch commented Dec 20, 2024 •

edited

Loading

man-shu commented Dec 20, 2024 •

edited

Loading