Details about Data Preprocessing #13

BolinLai · 2024-06-10T18:39:58Z

Thank you for your awesome work and codes! I want to ask more about data preprocessing to train the model. Can you let me know

Where to download the data for instruction tuning?
How to preprocess the raw data? Any codes?
What's the structure of preprocessed data should look like to run the training codes?

geyuying · 2024-07-21T06:16:57Z

Hi, we currently support the following dataloader with the specified data structure.

For "build_llava_jsonl_datapipes" dataloader, each folder stores a number of jsonl files, each jsonl file contains 10K pieces of content, with an example of the content as follows:

{"image": "coco/train2017/000000033471.jpg", "data": ["What are the colors of the bus in the image?", "The bus in the image is white and red.", "What feature can be seen on the back of the bus?", "The back of the bus features an advertisement.", "Is the bus driving down the street or pulled off to the side?", "The bus is driving down the street, which is crowded with people and other vehicles."]}

For "build_caption_datapipes_with_pixels" dataloder, each folder stores a number of .tar files and reads image-text pairs in the form of webdataset.

For "build_single_turn_edit_datapipes" dataloder, each folder stores a number of jsonl files, each jsonl file contains 10K pieces of content, with an example of the content as follows:

{"source_image": "source_images/f6f4d0669694df5b.jpg", "target_image": "target_images/f6f4d0669694df5b.jpg", "instruction": "Erase the car that is parked in front of the Roebuck building."}

arpitbansal297 · 2024-08-01T23:00:10Z

Hi! Thanks for this awesome work.
Can you let me know where to download the data from?
https://huggingface.co/datasets/AILab-CVC/SEED-Data-Edit is empty
The instructions given in
https://huggingface.co/datasets/AILab-CVC/SEED-Data-Edit-Part1-Openimages says "After downloading the data, you first need to reassemble the split files back into the original .tar.gz file as below, and then unzip the files." Can you please give more instructions on where to download this data from and how to process it to use the huggingface dataset.

Thanks in advance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Details about Data Preprocessing #13

Details about Data Preprocessing #13

BolinLai commented Jun 10, 2024

geyuying commented Jul 21, 2024

arpitbansal297 commented Aug 1, 2024

Details about Data Preprocessing #13

Details about Data Preprocessing #13

Comments

BolinLai commented Jun 10, 2024

geyuying commented Jul 21, 2024

arpitbansal297 commented Aug 1, 2024