Skip to content

Latest commit

 

History

History
23 lines (18 loc) · 1.89 KB

dataset_expansion.md

File metadata and controls

23 lines (18 loc) · 1.89 KB

Proposed Management Process for Future Datasets

As the number of datasets increases, it is important to establish a streamlined process for handling requests from AI researchers, acquiring datasets, storing them, and managing them efficiently. Below is the proposed process:

1. Dataset Requests and Review

  • AI researchers can submit dataset requests through a designated channel, such as a shared email, form, or ticketing system.
  • The request should include details such as the dataset name, source (e.g., Hugging Face), size, and any specific instructions.
  • The data team reviews the request to ensure it aligns with project goals and that the necessary resources are available.

2. Acquiring Datasets

  • Once approved, the data team determines the methods to download, partition and store the dataset.
  • If a general class like HuggingFaceParquetDataset is available, use it to initialize the metadata of the new dataset. Otherwise define a designated dataset class and add it to pygestor/datasets.
  • The dataset is downloaded using pygestor, and subsets and partitions are automatically organized.
  • Verify that the dataset has been downloaded and stored correctly.

3. Utilizing Datasets

  • AI researchers can load the dataset using pygestor as an access point on a local or cloud machine connected to the designated NFS.
  • data loading code snippets can be generated by the WebUI and allows for quick access.
  • The dataset can be fully or partially loaded. Batched loading is possible and recommmended during model training to optimize memory efficiency.

4. Managing Datasets

  • The metadata is maintained for each dataset, including details such as availability, source, size, acquisition date and notes.
  • Use the WebUI to regularly review stored datasets to ensure they remain relevant and up-to-date. Remove irelevant datasets to free up storage.