Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Folder organisations #86

Open
lionelkusch opened this issue Dec 20, 2024 · 25 comments
Open

Folder organisations #86

lionelkusch opened this issue Dec 20, 2024 · 25 comments
Labels
coding style question regarding formatting and declaration of functions file organisation the organisation of the different files management of project question regarding the policy of the project

Comments

@lionelkusch
Copy link
Collaborator

From the PR #73, we need to have a vision about the organization of the library.

@lionelkusch lionelkusch added management of project question regarding the policy of the project coding style question regarding formatting and declaration of functions labels Dec 20, 2024
@lionelkusch
Copy link
Collaborator Author

lionelkusch commented Dec 20, 2024

My detail vision of it is:

.
+--ReadMe.txt
+--pyproject.toml
+--LICENCE
+--codecov.yml
+--.gitignore
+--doc_conf 
|  +--conf.py (configuration file)
|  +--Makefile (make file for creating the library)
|  +--documentation (folder for additional documentation)
|       +--api.rst (index of the API)
|       +--index.rst (home page of the documentation)
|       +--... (specific pages such as how to contribute, ....)
|       +--references.bib (bibliography reference)
|  +--docs (folder where to build the documentation)
+--examples (folder with examples)
|  +--__init__.py
|  +--figures (folder to save all the generated figures and additional figures)
|      +--generates (folders which contains generated by examples)
|         +--get_started
|              +--find_importance_variable_1.png
|              +--...
|         +--models
|              +--ada_svr_1.png (illustrate the data of the toy_dataset)
|              +--ada_svr_2.png (illustrate the results on the toy_dataset)
|              +--...
|         +--estimators 
|              +--dnn_learner_1.png 
|              +--...
|         +--comparison_models
|              +--geometric_2D_1.png
|              +--benchmarks_1D_dataset_1.png
|              +--...
|      +--external (folder which contains figures to illustrate the examples)
|         +--get_started
|              +--...
|         +--models
|              +--ada_svr_1.py (figure 1of Gaonkar et al. 2012)
|              +--...
|         +--comparison_models
|              +--...
|  +--_utils (folder with the function for examples)
|      +--__init__.py
|      +--plot_function.py (functions for plotting datasets or results)
|      +--....
|  +--get_started (folder for having tutorials or basic examples of how to use Hidimstat)
|      +--find_importance_variable.py (apply one generic method to a dataset)
|      +--....
|  +--models (folder which shows how to use models on toy dataset
                        and after a description of the methods
                        and a description of advantages, disadvantages and assumptions)
|      +--ada_svr.py 
|      +--permutation_test.py
|      +--.....
|  +--estimator (folder which shows how to use an estimator on toy dataset
                        and after a description of the estimator
                        and a description of advantages, disadvantages and assumptions)
|      +--dnn_learner.py 
|      +--.....
|  +--comparison_models (folder where the models are compared 
                                               on toy dataset with a specific characteristic
                                               (linear/non-linear, geometric, ...)
                                               and benchmark)
|      +--geometric_2D.py
|      +--....
|      +--benchmarks (study the speed of each model)
|          +--1D_dataset.py
|          +--scalability.py
|          +--performance p~n.py
|          +--....
+--hidimstat (folder with code)
|  +--__init__.py
|  +--models (folder with models for estimate variables of importance)
|      +--__init__.py
|      +--_utils (folder with functions shared between functions of the sub-packages)
|          +--__init__.py
|          +--scikit-learn_estimator.py (generic function for using the estimator API)
|          +--....
|      +--tests (folder for testing the models)
|          +--__init__.py
|          +--_utils (folder with functions shared between tests)
|              +--__init__.py
|              +--....
|          +--test_ada_svr.py
|          +--test_permutation_test.py
|          +--....
|      +--ada_svr.py
|      +--permutation_test.py
|      +--....
|  +--estimator (folder with specific estimators)
|      +--__init__.py
|      +--_utils (folder with functions shared between functions of the sub-packages)
|          +--__init__.py
|          +--....
|      +--tests (folder for testing the estimators)
|          +--__init__.py
|          +--_utils (folder with functions shared between tests)
|              +--__init__.py
|              +--....
|          +--test_dnn_learner.py
|          +--....
|      +--dnn_learner.py
|      +--....
|  +--extra (folder for generation of toy_dataset and statistics methods)
|      +--__init__.py
|      +--toy_data
|          +--__init__.py
|          +--_utils (folder with functions shared between functions of the subsub-packages)
|              +--__init__.py
|              +--....
|          +--tests (folder for testing the generated function)
|              +--__init__.py
|              +--_utils (folder with functions shared between tests)
|                  +--__init__.py
|                  +--....
|              +--test_1d_dataset.py
|              +--....
|          +--1d_dataset.py
|          +--....
|      +--stat_tools (the folder contains methods for calculating pvalue and calculating error rate)
|          +--__init__.py
|          +--_utils (folder with functions shared between functions of the subsub-packages)
|              +--__init__.py
|              +--....
|          +--tests (folder for testing the generated function)
|              +--__init__.py
|              +--_utils (folder with functions shared between tests)
|                  +--__init__.py
|                  +--....
|              +--test_pval.py
|              +--....
|          +--pval.py
|          +--....

@lionelkusch
Copy link
Collaborator Author

lionelkusch commented Dec 20, 2024

It will be impossible to talk about this very detailed vision.
To summarise:

  • doc_conf (contains all the documentation)
  • examples (contains all the examples)
  • hidimstat/models (contains all the models )
  • hidimstat/estimators (contains all the estimators)
  • hidimstat/extra (contains functions for the generation of the dataset and statistics methods (pvalue calculus, calculus of error of type I, type II, ...)

@lionelkusch
Copy link
Collaborator Author

The doc_conf is composed of:

  • configuration files and makefile
  • additional documentation pages

For the moment, there is not a specific organisation of the different nonautogenerated documentation.

@lionelkusch
Copy link
Collaborator Author

lionelkusch commented Dec 20, 2024

The examples folders is composed of 5 folders:

  • figures: organize the generated figure for the documentation and store the figure for helping to understand the examples
  • get_started: very basic usage of the library
  • models: example of usage models with some explanation of it
  • estimators: example of usage estimators with some explanation of it
  • comparison_models: example of comparison models and benchmarks

@lionelkusch
Copy link
Collaborator Author

hidimstat is composed of 3 sub-packages:

  • estimators
  • models
  • extra/toy_data (function for the generation of toy_dataset)
  • extra/stat_tools (functions for statistical tools)

If sub_packages are composed of a _utils folder (shared function in subpackage), a test folder (for the tests) and the functions.

@lionelkusch
Copy link
Collaborator Author

lionelkusch commented Dec 20, 2024

@bthirion @Remi-Gau @jpaillard @man-shu
If you have any suggestions on how to organize the files, I had love to hear from you.

@Remi-Gau
Copy link
Collaborator

quick things:

@jpaillard
Copy link
Collaborator

I have a few suggestions:

  • The names 'models' and 'estimator' can be confusing since they are sometimes used interchangeably. A more explicit name like 'stat_models' and 'prediction_models' could be more transparent.
  • For the toy datasets, having a datasets folder would be clearer; it's organized this way in Sklearn and Nilearn
  • That could avoid having the 'extra' folder, which is not so straightforward. Instead, we have 'datasets' and 'stat_tools.'

@lionelkusch
Copy link
Collaborator Author

The toy_dataset won't contain data; it will contain only functions for generating data, in my view.
I don't think it's a good idea to have a dataset for the moment.

@man-shu
Copy link

man-shu commented Dec 20, 2024

My suggestions:

  • I would keep the names of submodules short (without _). I agree that names models and estimators are confusing. I could suggest alternative names for models and estimators, but I don't understand the difference between the two in the context of the package. Maybe you could point me to some code?
  • I think it would be good to keep toy_datasets and stat_tools separate as they don't seem to be related and also rename to datasets and stats. Correct me if I am wrong, but I don't think you would provide actual datasets with the package, they would still be the code to fetch some real datasets, so datasets seems appropriate to me.

@Remi-Gau
Copy link
Collaborator

  • I would keep the names of submodules short (without _)

except for the _utils folder that make sense when you want to keep things private to a subpackage

2 tools I have used in other projects:

@bthirion
Copy link
Contributor

The examples folders is composed of 5 folders:

* figures: organize the generated figure for the documentation and store the figure for helping to understand the examples

* get_started: very basic usage of the library

* models: example of usage models with some explanation of it

* estimators: example of usage estimators with some explanation of it

* comparison_models: example of comparison models and benchmarks

I think I'd like to start with something simpler while we have very few examples, and reorganize a posteriori depending on the examples we have.

@bthirion
Copy link
Contributor

hidimstat is composed of 3 sub-packages:

* estimators

* models

* extra/toy_data (function for the generation of toy_dataset)

* extra/stat_tools (functions for statistical tools)

If sub_packages are composed of a _utils folder (shared function in subpackage), a test folder (for the tests) and the functions.

I would not have too many levels. extra/toy_data and extra/stat_tools should rather be something like a utils module. (extra is not a good name I'm afraid).

@bthirion
Copy link
Contributor

The toy_dataset won't contain data; it will contain only functions for generating data, in my view. I don't think it's a good idea to have a dataset for the moment.

Why ?

@bthirion
Copy link
Contributor

It will be impossible to talk about this very detailed vision. To summarise:

* doc_conf (contains all the documentation)

* examples (contains all the examples)

* hidimstat/models (contains all the models )

* hidimstat/estimators (contains all the estimators)

* hidimstat/extra (contains functions for the generation of the dataset and statistics methods (pvalue calculus, calculus of error of type I, type II, ...)

Thx for bringing up this discussion !

@lionelkusch
Copy link
Collaborator Author

The toy_dataset won't contain data; it will contain only functions for generating data, in my view. I don't think it's a good idea to have a dataset for the moment.

Why ?

We don't have a specific dataset where the data are present required to be stored at the moment.
The datasets used in examples or tests are generated datasets or from other libraries (mne, nilearn or scikit-learn).
For the datasets from other libraries, functions to get data already exist and I want to prioritise only datasets from scitkit-learn to avoid dependence on other libraries. We need only to have a function for the generation dataset based on a random generator.

@lionelkusch
Copy link
Collaborator Author

hidimstat is composed of 3 sub-packages:

* estimators

* models

* extra/toy_data (function for the generation of toy_dataset)

* extra/stat_tools (functions for statistical tools)

If sub_packages are composed of a _utils folder (shared function in subpackage), a test folder (for the tests) and the functions.

I would not have too many levels. extra/toy_data and extra/stat_tools should rather be something like a utils module. (extra is not a good name I'm afraid).

I didn't add them to _utils because there are functions required, for example and tests. They shouldn't be private functions. We need to make a difference between side functions, which are public and side functions, which are private.

@lionelkusch
Copy link
Collaborator Author

I think I'd like to start with something simpler while we have very few examples, and reorganize a posteriori depending on the examples we have.

In my opinion, there are missing examples; it's why I want to add them.

The example will be here for answering 2 questions to users:

  • Which methods to use?
  • What are the methods?
    I propose it because I am not aware of a book or a review which lists the different methods and their domain of the application. The only exception is the introduction of the thesis of Ahmad but I don't think it's the best format for popularising this information.

@Remi-Gau
Copy link
Collaborator

random thought (feel free to ignore): may be easier to have some rules or guideline, that the project should follow regarding folder structure and try try to slowly implement it, rather than trying to find the 'right' structure.

Obviously easier said than done.

@lionelkusch
Copy link
Collaborator Author

I don't plan for a brutal refactoring of the project. It's more to have a direction where to move.
I plan to refactor one model by one and change the structure, little by little, at the same time.

@bthirion
Copy link
Contributor

The toy_dataset won't contain data; it will contain only functions for generating data, in my view. I don't think it's a good idea to have a dataset for the moment.

Why ?

We don't have a specific dataset where the data are present required to be stored at the moment. The datasets used in examples or tests are generated datasets or from other libraries (mne, nilearn or scikit-learn). For the datasets from other libraries, functions to get data already exist and I want to prioritise only datasets from scitkit-learn to avoid dependence on other libraries. We need only to have a function for the generation dataset based on a random generator.

But I'd like to reuse, as much as possible, public datasets, because they are known to users. Generating data means that you "invent" (at least come up with) the problem together with the solution, which is not great. I'd really like to confine generated data to situations where there is no other possibility.

@bthirion
Copy link
Contributor

I think I'd like to start with something simpler while we have very few examples, and reorganize a posteriori depending on the examples we have.

In my opinion, there are missing examples; it's why I want to add them.

The example will be here for answering 2 questions to users:

* Which methods to use?

* What are the methods?
  I propose it because I am not aware of a book or a review which lists the different methods and their domain of the application. The only exception is the introduction of the thesis of Ahmad but I don't think it's the best format for popularising this information.

We should start with https://christophm.github.io/interpretable-ml-book/ and https://shap.readthedocs.io

@lionelkusch
Copy link
Collaborator Author

I separate the different discussion in different issues for going in mode details:

If I miss a point or you have a new point, you can open an issue or add a comment here.

@lionelkusch lionelkusch added the file organisation the organisation of the different files label Dec 24, 2024
@lionelkusch
Copy link
Collaborator Author

Based on the issue #93, there won't be a separate folder for "side function".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
coding style question regarding formatting and declaration of functions file organisation the organisation of the different files management of project question regarding the policy of the project
Projects
None yet
Development

No branches or pull requests

5 participants