-
Notifications
You must be signed in to change notification settings - Fork 1
Create datasets
The dataset files in the pbdl
package are essentially .hdf5
files that follow specific conventions. An .hdf5
file is a hierarchically structured collection of arrays (see the official documentation). For pbdl
datasets, the hierarchy is kept simple: there is a single group, sim/
, which contains all simulations (NumPy arrays). Arrays must have the name sim
concatenated with an incremental index.
Each simulation array has the shape (frame, field, spatial dims...)
.
Metadata is attached to sims/
group and can contain all key-value pairs that can be formatted in JSON. The following attributes are required as a minimum
-
PDE
: Which type of PDE is represented by the dataset? -
Field Scheme
: Contains information about the types of physical fields represented as a string, such asVVdp
. Consecutive identical letters indicate that the physical field comprises the corresponding indices (e.g., velocity x and velocity y form a vector field due to two consecutiveV
s, while density and pressure are scalar fields). This information determines how normalization is applied: for vector fields, the vector norm is applied before calculating the standard deviation. -
Fields
: List of field identifiers. -
Constants
: List of constant identifiers. -
Dt
: Time delta between simulation frames.
Metadata attributes that are not required but recommended are:
-
Field Desc
andConst Desc
: Description of the fields/constants as a short sentence. This information is used for the presentation in the dataset gallery.
Constants are attached to the corresponding simulation (e.g. sims/sim0
). For each constant listed in the metadata attribute Constants
there must be a corresponding attribute attached to the simulations.
import numpy as np
import h5py
import os
meta_all = {
"PDE": "The Everything Formula",
"Fields Scheme": "aBBc",
"Fields": ["Field1", "Field2a", "Field2b", "Field3"],
"Constants": ["Const1"],
"Dt": 0.01,
}
DATASET_NAME = "random"
NUM_SIMS = 3
with h5py.File(f"{DATASET_NAME}/random.hdf5", "w") as f:
# create `NUM_SIMS` simulations
for i in range(NUM_SIMS):
# create random array with 1000 frames, 4 fields, and 128 x 64 frames
data = np.random.random((1000, 4, 128, 64))
# create hdf5 dataset
sim = f.create_dataset("sims/sim" + str(i), data=data)
# attach constant to simulation
sim.attrs["Const1"] = np.random.random()
# attach metadata to group
for key, value in meta_all.items():
f["sims/"].attrs[key] = value
import numpy as np
import h5py
import os
import json
meta_all = {
"PDE": "The Everything Formula",
"Fields Scheme": "aBBc",
"Fields": ["Field1", "Field2a", "Field2b", "Field3"],
"Constants": ["Const1"],
"Dt": 0.01,
}
DATASET_NAME = "random"
NUM_FILES = 3
SIMS_PER_FILE = 10
# group all hdf5 files in a directory
os.makedirs(DATASET_NAME, exist_ok=True)
# there are `NUM_FILES` hdf5 files (each contains multiple sims)
for file_idx in range(NUM_FILES):
# the files name specifies the sims that are contained, e.g. sim100-199.hdf
with h5py.File(f"{DATASET_NAME}/sim{file_idx * SIMS_PER_FILE}-{(file_idx + 1) * SIMS_PER_FILE - 1}.hdf5", "w") as f:
# each file contains `SIMS_PER_FILE` sims
for sim_idx in range(SIMS_PER_FILE):
# create random array with 1000 frames, 4 fields, and 128 x 64 frames
data = np.random.random((1000, 4, 128, 64))
sim = f.create_dataset("sims/sim" + str(file_idx * SIMS_PER_FILE + sim_idx), data=data)
# attach constant to simulation
sim.attrs["Const1"] = np.random.random()
# store metadata for all sims in separate file
with open(f"{DATASET_NAME}/meta_all.json", "w") as f:
json.dump(meta_all, f, indent=2)