As the number of datasets increases, it is important to establish a streamlined process for handling requests from AI researchers, acquiring datasets, storing them, and managing them efficiently. Below is the proposed process:
- AI researchers can submit dataset requests through a designated channel, such as a shared email, form, or ticketing system.
- The request should include details such as the dataset name, source (e.g., Hugging Face), size, and any specific instructions.
- The data team reviews the request to ensure it aligns with project goals and that the necessary resources are available.
- Once approved, the data team determines the methods to download, partition and store the dataset.
- If a general class like HuggingFaceParquetDataset is available, use it to initialize the metadata of the new dataset. Otherwise define a designated dataset class and add it to pygestor/datasets.
- The dataset is downloaded using pygestor, and subsets and partitions are automatically organized.
- Verify that the dataset has been downloaded and stored correctly.
- AI researchers can load the dataset using pygestor as an access point on a local or cloud machine connected to the designated NFS.
- data loading code snippets can be generated by the WebUI and allows for quick access.
- The dataset can be fully or partially loaded. Batched loading is possible and recommmended during model training to optimize memory efficiency.
- The metadata is maintained for each dataset, including details such as availability, source, size, acquisition date and notes.
- Use the WebUI to regularly review stored datasets to ensure they remain relevant and up-to-date. Remove irelevant datasets to free up storage.