Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset folder #92

Open
lionelkusch opened this issue Dec 24, 2024 · 7 comments
Open

Dataset folder #92

lionelkusch opened this issue Dec 24, 2024 · 7 comments
Labels
dataset file organisation the organisation of the different files

Comments

@lionelkusch
Copy link
Collaborator

lionelkusch commented Dec 24, 2024

Based on the issue #86 :

For the toy datasets, having a datasets folder would be clearer; it's organized this way in Sklearn and Nilearn

Originally posted by @jpaillard in #86 (comment)

@lionelkusch
Copy link
Collaborator Author

lionelkusch commented Dec 24, 2024

The toy_dataset won't contain data; it will contain only functions for generating data, in my view.
I don't think it's a good idea to have a dataset for the moment.

Originally posted by @lionelkusch in #86 (comment)

@lionelkusch
Copy link
Collaborator Author

lionelkusch commented Dec 24, 2024

Correct me if I am wrong, but I don't think you would provide actual datasets with the package, they would still be the code to fetch some real datasets, so datasets seems appropriate to me.

Originally posted by @man-shu in #86 (comment)

@lionelkusch
Copy link
Collaborator Author

lionelkusch commented Dec 24, 2024

The toy_dataset won't contain data; it will contain only functions for generating data, in my view. I don't think it's a good idea to have a dataset for the moment.

Why ?

Originally posted by @bthirion in #86 (comment)

@lionelkusch
Copy link
Collaborator Author

lionelkusch commented Dec 24, 2024

The toy_dataset won't contain data; it will contain only functions for generating data, in my view. I don't think it's a good idea to have a dataset for the moment.

Why ?

We don't have a specific dataset where the data are present required to be stored at the moment. The datasets used in examples or tests are generated datasets or from other libraries (mne, nilearn or scikit-learn). For the datasets from other libraries, functions to get data already exist and I want to prioritise only datasets from scitkit-learn to avoid dependence on other libraries. We need only to have a function for the generation dataset based on a random generator.

But I'd like to reuse, as much as possible, public datasets, because they are known to users. Generating data means that you "invent" (at least come up with) the problem together with the solution, which is not great. I'd really like to confine generated data to situations where there is no other possibility.

Originally posted by @lionelkusch in #86 (comment)

@lionelkusch lionelkusch added file organisation the organisation of the different files dataset labels Dec 24, 2024
@lionelkusch
Copy link
Collaborator Author

The advantage of creating toy datasets is that you have total control over the noise and statistical properties.

Additionally, I recently sought a dataset and encountered a lack of proposals.

In light of this experience, I can think of only two options for obtaining a dataset:
—Utilising the datasets offered by sklearn. The generating function for datasets can effectively replace many data generation functions of the library and provide authentic datasets.
—Datasets used in Molnar’s book. This solution requires storing the datasets and providing access functions.

@bthirion @jpaillard @AngelReyero What is the best for you ?

@bthirion
Copy link
Contributor

I don't think that we'll be happy with a unique dataset. A given dataset can be good to feature some effects, but not others.
I'd start with the simplest approach and look locally whether the examples render better if run on alternative datasets. If we get convinced that this is worth the cost, we add a dataset.
Ultimately, we will also consider generated dataset, but only if there is no other reasonable possibility. The core point is that the reader/user has to understand the logic of data generation, which is often tricky. At the end, you likely discourage most users by doing that.

@lionelkusch
Copy link
Collaborator Author

Do you have some dataset in mind?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dataset file organisation the organisation of the different files
Projects
None yet
Development

No branches or pull requests

2 participants