Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about generate dataset using make_dataset.py #2

Open
zzwei1 opened this issue Aug 10, 2021 · 2 comments
Open

Questions about generate dataset using make_dataset.py #2

zzwei1 opened this issue Aug 10, 2021 · 2 comments

Comments

@zzwei1
Copy link

zzwei1 commented Aug 10, 2021

Hi, nice work !

I'm trying to explore SEVIR dataset and I wonna generate a dataset with make_dataset.py.

I have modified make_dataset.py and I split each event into 2 trainig samples.

I have run make_dataset.py, and I have got files named as nowcast_training_000.h5, nowcast_testing_000.h5, ..., nowcast_training_008.h5, nowcast_training_008.h5, and their corresponding xxx_META.csv files. (I remain the parameter "n_chunks" as the default value 8).

However, I don't understand the relations in these files, and I have the following confusions,

  1. Is the data in xxx_000.h5 the same as that in xxx_001.h5 and others but with different data orders, or the data in xxx_000.h5 is not the same as that in xxx_001.h5 and others.

  2. Should I use one of the file pairs for training and testing (such as nowcast_training_000.h5 for training and nowcast_testinging_000.h5 for testing ), or using all of the files, or setting the parameter "append" to "True" to write the 8 chunks into 1 training file and 1 testing file ?

Thanks in advance !

@markveillette
Copy link
Collaborator

Thanks!

The data in each file is not the same. Each file contains a collection of samples drawn from the full SEVIR dataset created using the load_batches method. Spreading the data across multiple files prevents the script from writing one giant file. Use the append option if you are okay with one large file.

How you use the file pairs depends. If you have limited RAM, but want to use all the data, you could load one file at a time and process them sequentially. Or, if you have access to multiple GPUs, you could assign each file to a different GPU.

Hopefully that clears things up

@zzwei1
Copy link
Author

zzwei1 commented Aug 13, 2021

Thanks!

The data in each file is not the same. Each file contains a collection of samples drawn from the full SEVIR dataset created using the load_batches method. Spreading the data across multiple files prevents the script from writing one giant file. Use the append option if you are okay with one large file.

How you use the file pairs depends. If you have limited RAM, but want to use all the data, you could load one file at a time and process them sequentially. Or, if you have access to multiple GPUs, you could assign each file to a different GPU.

Hopefully that clears things up

Thanks very much ! I have understood through your clear statement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants