Proposed Management Process for Future Datasets

As the number of datasets increases, it is important to establish a streamlined process for handling requests from AI researchers, acquiring datasets, storing them, and managing them efficiently. Below is the proposed process:

1. Dataset Requests and Review

AI researchers can submit dataset requests through a designated channel, such as a shared email, form, or ticketing system.
The request should include details such as the dataset name, source (e.g., Hugging Face), size, and any specific instructions.
The data team reviews the request to ensure it aligns with project goals and that the necessary resources are available.

2. Acquiring Datasets

Once approved, the data team determines the methods to download, partition and store the dataset.
If a general class like HuggingFaceParquetDataset is available, use it to initialize the metadata of the new dataset. Otherwise define a designated dataset class and add it to pygestor/datasets.
The dataset is downloaded using pygestor, and subsets and partitions are automatically organized.
Verify that the dataset has been downloaded and stored correctly.

3. Utilizing Datasets

AI researchers can load the dataset using pygestor as an access point on a local or cloud machine connected to the designated NFS.
data loading code snippets can be generated by the WebUI and allows for quick access.
The dataset can be fully or partially loaded. Batched loading is possible and recommmended during model training to optimize memory efficiency.

4. Managing Datasets

The metadata is maintained for each dataset, including details such as availability, source, size, acquisition date and notes.
Use the WebUI to regularly review stored datasets to ensure they remain relevant and up-to-date. Remove irelevant datasets to free up storage.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dataset_expansion.md

dataset_expansion.md

Proposed Management Process for Future Datasets

1. Dataset Requests and Review

2. Acquiring Datasets

3. Utilizing Datasets

4. Managing Datasets

Files

dataset_expansion.md

Latest commit

History

dataset_expansion.md

File metadata and controls

Proposed Management Process for Future Datasets

1. Dataset Requests and Review

2. Acquiring Datasets

3. Utilizing Datasets

4. Managing Datasets