You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We observed that high dimensional datasets are much slower to read when they are virtual (versioned) Datasets:
In [12]: shape = (19, 36, 26, 1)
In [14]: a = np.random.rand(*shape)
...: with TempDirCtx() as d:
...: with h5py.File(d / 'foo.h5', 'w') as f:
...: vf = VersionedHDF5File(f)
...: with vf.stage_version('v0') as sv:
...: sv.create_dataset('bar', data=a, chunks=a.shape)
...: with h5py.File(d / 'foo.h5', 'r') as f:
...: vf = VersionedHDF5File(f)
...: cv = vf[vf.current_version]
...: bar = cv['bar']
...: %time _ = [bar[:] for _ in range(1000)]
...:
CPU times: user 2.95 s, sys: 61.8 ms, total: 3.01 s
Wall time: 3.01 s
In [15]: a = np.random.rand(*shape)
...: with TempDirCtx() as d:
...: with h5py.File(d / 'foo.h5', 'w') as f:
...: f.create_dataset('bar', data=a, chunks=a.shape)
...: with h5py.File(d / 'foo.h5', 'r') as f:
...: bar = f['bar']
...: %time _ = [bar[:] for _ in range(1000)]
...:
CPU times: user 37.3 ms, sys: 60.2 ms, total: 97.5 ms
Wall time: 97.2 ms
A little bit of profiling points to H5S__hyper_project_intersection being an expensive function:
I went quite a bit back in versions but I could not reproduce this.
In all tests:
h5py 3.12.1
libhdf5 1.14.3 OR 1.10
ndindex 1.9.2
h5py
CPU times: user 17.6 ms, sys: 853 μs, total: 18.4 ms
Wall time: 18.4 ms
versioned-hdf5 1.7.0
CPU times: user 15.6 ms, sys: 74.7 ms, total: 90.3 ms
Wall time: 89.9 ms
versioned-hdf5 2.0.0
CPU times: user 12.8 ms, sys: 59.2 ms, total: 72 ms
Wall time: 79.0 ms
Note that current_version is always a virtual hdf5 dataset (it's not a StagedChangesArray) and the versioned-hdf5 data layout on disk hasn't changed recently. So this issue should be exclusively dependent on libhdf5.
We observed that high dimensional datasets are much slower to read when they are virtual (versioned) Datasets:
A little bit of profiling points to
H5S__hyper_project_intersection
being an expensive function:Is it possible to speed up this function?
The text was updated successfully, but these errors were encountered: