You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Writing virtual datasets seems to be pretty slow because of the calls to deepcopy in VirtualSource.__getitem__:
In [26]: %%time
...: with TempDirCtx() as d:
...: with h5py.File(d / 'foo.h5', 'w') as f:
...: a = np.random.rand(1, 36, 26, 19)
...: f.create_dataset('bar', data=a, chunks=a.shape, maxshape=(None, None, None, None))
...: for i in range(1, 101):
...: with h5py.File(d / 'foo.h5', 'r+') as f:
...: f['bar'].resize((i + 1, 36, 26, 19))
...: a = np.random.rand(1, 36, 26, 19)
...: f['bar'][i, :, :, :] = a
...:
...:
CPU times: user 129 ms, sys: 8.01 ms, total: 137 ms
Wall time: 137 ms
In [27]: %%time
...: with TempDirCtx() as d:
...: with h5py.File(d / 'foo.h5', 'w') as f:
...: vf = VersionedHDF5File(f)
...: with vf.stage_version('v0') as sv:
...: a = np.random.rand(1, 36, 26, 19)
...: sv.create_dataset('bar', data=a, chunks=a.shape, maxshape=(None, None, None, None))
...: for i in range(1, 101):
...: with h5py.File(d / 'foo.h5', 'r+') as f:
...: vf = VersionedHDF5File(f)
...: with vf.stage_version('v{i}'.format(i=i)) as sv:
...: sv['bar'].resize((i + 1, 36, 26, 19))
...: a = np.random.rand(1, 36, 26, 19)
...: sv['bar'][[i], ...] = a
...:
...:
...:
CPU times: user 2.65 s, sys: 49.3 ms, total: 2.7 s
Wall time: 2.7 s
Looking at the code it seems that there was some performance optimization there which was broken by h5py version 3.3: h5py/h5py#1905
Is it possible to work around this performance degradation?
The text was updated successfully, but these errors were encountered:
I also ran into the same problem in _recreate_virtual_dataset where the deepcopy also dominates performance.
We can evade the deepcopy if we replace
layout[c.raw] = vs[idx.raw]
by
vs = VirtualSource('.', name=raw_data.name, shape=raw_data.shape, dtype=raw_data.dtype)
key = idx.raw
vs.sel = select(vs.shape, key, dataset=None)
_convert_space_for_key(vs.sel.id, key)
layout[c.raw] = vs
which does seem to be faster. In the case I am currently debugging the time drops from 749s to 507s. I think there is probably still room for a lot of improvement, though.
Writing virtual datasets seems to be pretty slow because of the calls to
deepcopy
inVirtualSource.__getitem__
:Looking at the code it seems that there was some performance optimization there which was broken by
h5py
version 3.3:h5py/h5py#1905
Is it possible to work around this performance degradation?
The text was updated successfully, but these errors were encountered: