Skip to content

Create datasets

Sebastian Pfister edited this page Dec 7, 2024 · 1 revision

The dataset files in the pbdl package are essentially .hdf5 files that follow specific conventions. An .hdf5 file is a hierarchically structured collection of arrays (see the official documentation). For pbdl datasets, the hierarchy is kept simple: there is a single group, sim/, which contains all simulations (NumPy arrays). Arrays must have the name sim concatenated with an incremental index.

Each simulation array has the shape (frame, field, spatial dims...).

Metadata

Metadata is attached to sims/ group and can contain all key-value pairs that can be formatted in JSON. The following attributes are required as a minimum

  • PDE: Which type of PDE is represented by the dataset?
  • Field Scheme: Contains information about the types of physical fields represented as a string, such as VVdp. Consecutive identical letters indicate that the physical field comprises the corresponding indices (e.g., velocity x and velocity y form a vector field due to two consecutive Vs, while density and pressure are scalar fields). This information determines how normalization is applied: for vector fields, the vector norm is applied before calculating the standard deviation.
  • Fields: List of field identifiers.
  • Constants: List of constant identifiers.
  • Dt: Time delta between simulation frames.

Metadata attributes that are not required but recommended are:

  • Field Desc and Const Desc: Description of the fields/constants as a short sentence. This information is used for the presentation in the dataset gallery.

Constants

Constants are attached to the corresponding simulation (e.g. sims/sim0). For each constant listed in the metadata attribute Constants there must be a corresponding attribute attached to the simulations.

Example: single file dataset

import numpy as np
import h5py
import os

meta_all = {
    "PDE": "The Everything Formula",
    "Fields Scheme": "aBBc",
    "Fields": ["Field1", "Field2a", "Field2b", "Field3"],
    "Constants": ["Const1"],
    "Dt": 0.01,
}

DATASET_NAME = "random"
NUM_SIMS = 3

with h5py.File(f"{DATASET_NAME}/random.hdf5", "w") as f:

    # create `NUM_SIMS` simulations
    for i in range(NUM_SIMS):
        # create random array with 1000 frames, 4 fields, and 128 x 64 frames
        data = np.random.random((1000, 4, 128, 64))

        # create hdf5 dataset
        sim = f.create_dataset("sims/sim" + str(i), data=data)

        # attach constant to simulation
        sim.attrs["Const1"] = np.random.random()

    # attach metadata to group
    for key, value in meta_all.items():
        f["sims/"].attrs[key] = value

Example: partitioned dataset (multiple sims per file)

import numpy as np
import h5py
import os
import json

meta_all = {
    "PDE": "The Everything Formula",
    "Fields Scheme": "aBBc",
    "Fields": ["Field1", "Field2a", "Field2b", "Field3"],
    "Constants": ["Const1"],
    "Dt": 0.01,
}

DATASET_NAME = "random"
NUM_FILES = 3
SIMS_PER_FILE = 10

# group all hdf5 files in a directory
os.makedirs(DATASET_NAME, exist_ok=True)

# there are `NUM_FILES` hdf5 files (each contains multiple sims)
for file_idx in range(NUM_FILES):

    # the files name specifies the sims that are contained, e.g. sim100-199.hdf
    with h5py.File(f"{DATASET_NAME}/sim{file_idx * SIMS_PER_FILE}-{(file_idx + 1) * SIMS_PER_FILE - 1}.hdf5", "w") as f:

        # each file contains `SIMS_PER_FILE` sims
        for sim_idx in range(SIMS_PER_FILE):
            # create random array with 1000 frames, 4 fields, and 128 x 64 frames
            data = np.random.random((1000, 4, 128, 64))

            sim = f.create_dataset("sims/sim" + str(file_idx * SIMS_PER_FILE + sim_idx), data=data)

            # attach constant to simulation
            sim.attrs["Const1"] = np.random.random()

# store metadata for all sims in separate file
with open(f"{DATASET_NAME}/meta_all.json", "w") as f:
    json.dump(meta_all, f, indent=2)