Questions about generate dataset using make_dataset.py #2

zzwei1 · 2021-08-10T13:45:54Z

Hi, nice work !

I'm trying to explore SEVIR dataset and I wonna generate a dataset with make_dataset.py.

I have modified make_dataset.py and I split each event into 2 trainig samples.

I have run make_dataset.py, and I have got files named as nowcast_training_000.h5, nowcast_testing_000.h5, ..., nowcast_training_008.h5, nowcast_training_008.h5, and their corresponding xxx_META.csv files. (I remain the parameter "n_chunks" as the default value 8).

However, I don't understand the relations in these files, and I have the following confusions,

Is the data in xxx_000.h5 the same as that in xxx_001.h5 and others but with different data orders, or the data in xxx_000.h5 is not the same as that in xxx_001.h5 and others.
Should I use one of the file pairs for training and testing (such as nowcast_training_000.h5 for training and nowcast_testinging_000.h5 for testing ), or using all of the files, or setting the parameter "append" to "True" to write the 8 chunks into 1 training file and 1 testing file ？

Thanks in advance !

markveillette · 2021-08-12T02:39:01Z

Thanks!

The data in each file is not the same. Each file contains a collection of samples drawn from the full SEVIR dataset created using the load_batches method. Spreading the data across multiple files prevents the script from writing one giant file. Use the append option if you are okay with one large file.

How you use the file pairs depends. If you have limited RAM, but want to use all the data, you could load one file at a time and process them sequentially. Or, if you have access to multiple GPUs, you could assign each file to a different GPU.

Hopefully that clears things up

zzwei1 · 2021-08-13T03:02:58Z

Thanks!

The data in each file is not the same. Each file contains a collection of samples drawn from the full SEVIR dataset created using the load_batches method. Spreading the data across multiple files prevents the script from writing one giant file. Use the append option if you are okay with one large file.

How you use the file pairs depends. If you have limited RAM, but want to use all the data, you could load one file at a time and process them sequentially. Or, if you have access to multiple GPUs, you could assign each file to a different GPU.

Hopefully that clears things up

Thanks very much ! I have understood through your clear statement.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about generate dataset using make_dataset.py #2

Questions about generate dataset using make_dataset.py #2

zzwei1 commented Aug 10, 2021 •

edited

Loading

markveillette commented Aug 12, 2021

zzwei1 commented Aug 13, 2021

Questions about generate dataset using make_dataset.py #2

Questions about generate dataset using make_dataset.py #2

Comments

zzwei1 commented Aug 10, 2021 • edited Loading

markveillette commented Aug 12, 2021

zzwei1 commented Aug 13, 2021

zzwei1 commented Aug 10, 2021 •

edited

Loading