Dataset folder #92

lionelkusch · 2024-12-24T08:49:42Z

Based on the issue #86 :

For the toy datasets, having a datasets folder would be clearer; it's organized this way in Sklearn and Nilearn

Originally posted by @jpaillard in #86 (comment)

lionelkusch · 2024-12-24T08:51:12Z

The toy_dataset won't contain data; it will contain only functions for generating data, in my view.
I don't think it's a good idea to have a dataset for the moment.

Originally posted by @lionelkusch in #86 (comment)

lionelkusch · 2024-12-24T08:53:24Z

Correct me if I am wrong, but I don't think you would provide actual datasets with the package, they would still be the code to fetch some real datasets, so datasets seems appropriate to me.

Originally posted by @man-shu in #86 (comment)

lionelkusch · 2024-12-24T08:56:36Z

The toy_dataset won't contain data; it will contain only functions for generating data, in my view. I don't think it's a good idea to have a dataset for the moment.

Why ?

Originally posted by @bthirion in #86 (comment)

lionelkusch · 2024-12-24T08:58:54Z

The toy_dataset won't contain data; it will contain only functions for generating data, in my view. I don't think it's a good idea to have a dataset for the moment.

Why ?

We don't have a specific dataset where the data are present required to be stored at the moment. The datasets used in examples or tests are generated datasets or from other libraries (mne, nilearn or scikit-learn). For the datasets from other libraries, functions to get data already exist and I want to prioritise only datasets from scitkit-learn to avoid dependence on other libraries. We need only to have a function for the generation dataset based on a random generator.

But I'd like to reuse, as much as possible, public datasets, because they are known to users. Generating data means that you "invent" (at least come up with) the problem together with the solution, which is not great. I'd really like to confine generated data to situations where there is no other possibility.

Originally posted by @lionelkusch in #86 (comment)

lionelkusch · 2024-12-26T08:34:56Z

The advantage of creating toy datasets is that you have total control over the noise and statistical properties.

Additionally, I recently sought a dataset and encountered a lack of proposals.

In light of this experience, I can think of only two options for obtaining a dataset:
—Utilising the datasets offered by sklearn. The generating function for datasets can effectively replace many data generation functions of the library and provide authentic datasets.
—Datasets used in Molnar’s book. This solution requires storing the datasets and providing access functions.

@bthirion @jpaillard @AngelReyero What is the best for you ?

bthirion · 2024-12-26T11:24:00Z

I don't think that we'll be happy with a unique dataset. A given dataset can be good to feature some effects, but not others.
I'd start with the simplest approach and look locally whether the examples render better if run on alternative datasets. If we get convinced that this is worth the cost, we add a dataset.
Ultimately, we will also consider generated dataset, but only if there is no other reasonable possibility. The core point is that the reader/user has to understand the logic of data generation, which is often tricky. At the end, you likely discourage most users by doing that.

lionelkusch · 2024-12-31T11:24:12Z

Do you have some dataset in mind?

lionelkusch mentioned this issue Dec 24, 2024

Folder organisations #86

Open

lionelkusch added file organisation the organisation of the different files dataset labels Dec 24, 2024

lionelkusch mentioned this issue Jan 6, 2025

Selection of model examples #106

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset folder #92

Dataset folder #92

lionelkusch commented Dec 24, 2024 •

edited

Loading

lionelkusch commented Dec 24, 2024 •

edited

Loading

lionelkusch commented Dec 24, 2024 •

edited

Loading

lionelkusch commented Dec 24, 2024 •

edited

Loading

lionelkusch commented Dec 24, 2024 •

edited

Loading

lionelkusch commented Dec 26, 2024

bthirion commented Dec 26, 2024

lionelkusch commented Dec 31, 2024

Dataset folder #92

Dataset folder #92

Comments

lionelkusch commented Dec 24, 2024 • edited Loading

lionelkusch commented Dec 24, 2024 • edited Loading

lionelkusch commented Dec 24, 2024 • edited Loading

lionelkusch commented Dec 24, 2024 • edited Loading

lionelkusch commented Dec 24, 2024 • edited Loading

lionelkusch commented Dec 26, 2024

bthirion commented Dec 26, 2024

lionelkusch commented Dec 31, 2024

lionelkusch commented Dec 24, 2024 •

edited

Loading

lionelkusch commented Dec 24, 2024 •

edited

Loading

lionelkusch commented Dec 24, 2024 •

edited

Loading

lionelkusch commented Dec 24, 2024 •

edited

Loading

lionelkusch commented Dec 24, 2024 •

edited

Loading