create_virtual_dataset is slow #226

ArvidJB · 2022-05-03T14:50:32Z

Writing virtual datasets seems to be pretty slow because of the calls to deepcopy in VirtualSource.__getitem__:

In [26]: %%time
    ...: with TempDirCtx() as d:
    ...:     with h5py.File(d / 'foo.h5', 'w') as f:
    ...:         a = np.random.rand(1, 36, 26, 19)
    ...:         f.create_dataset('bar', data=a, chunks=a.shape, maxshape=(None, None, None, None))
    ...:     for i in range(1, 101):
    ...:         with h5py.File(d / 'foo.h5', 'r+') as f:
    ...:             f['bar'].resize((i + 1, 36, 26, 19))
    ...:             a = np.random.rand(1, 36, 26, 19)
    ...:             f['bar'][i, :, :, :] = a
    ...:
    ...:
CPU times: user 129 ms, sys: 8.01 ms, total: 137 ms
Wall time: 137 ms

In [27]: %%time
    ...: with TempDirCtx() as d:
    ...:     with h5py.File(d / 'foo.h5', 'w') as f:
    ...:         vf = VersionedHDF5File(f)
    ...:         with vf.stage_version('v0') as sv:
    ...:             a = np.random.rand(1, 36, 26, 19)
    ...:             sv.create_dataset('bar', data=a, chunks=a.shape, maxshape=(None, None, None, None))
    ...:     for i in range(1, 101):
    ...:         with h5py.File(d / 'foo.h5', 'r+') as f:
    ...:             vf = VersionedHDF5File(f)
    ...:             with vf.stage_version('v{i}'.format(i=i)) as sv:
    ...:                 sv['bar'].resize((i + 1, 36, 26, 19))
    ...:                 a = np.random.rand(1, 36, 26, 19)
    ...:                 sv['bar'][[i], ...] = a
    ...:
    ...:
    ...:
CPU times: user 2.65 s, sys: 49.3 ms, total: 2.7 s
Wall time: 2.7 s

Looking at the code it seems that there was some performance optimization there which was broken by h5py version 3.3:
h5py/h5py#1905
Is it possible to work around this performance degradation?

The text was updated successfully, but these errors were encountered:

ArvidJB · 2022-05-20T17:03:27Z

I also ran into the same problem in _recreate_virtual_dataset where the deepcopy also dominates performance.

We can evade the deepcopy if we replace

                layout[c.raw] = vs[idx.raw]

by

                vs = VirtualSource('.', name=raw_data.name, shape=raw_data.shape, dtype=raw_data.dtype)
                key = idx.raw
                vs.sel = select(vs.shape, key, dataset=None)
                _convert_space_for_key(vs.sel.id, key)
                layout[c.raw] = vs

which does seem to be faster. In the case I am currently debugging the time drops from 749s to 507s. I think there is probably still room for a lot of improvement, though.

asmeurer mentioned this issue Jun 23, 2022

Improve the performance of create_virtual_dataset for h5py 3 #232

Merged

magsol added this to the June 2022 milestone Jul 14, 2022

magsol assigned asmeurer Jul 14, 2022

magsol modified the milestones: June 2022, July 2022 Jul 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

create_virtual_dataset is slow #226

create_virtual_dataset is slow #226

ArvidJB commented May 3, 2022

ArvidJB commented May 20, 2022

create_virtual_dataset is slow #226

create_virtual_dataset is slow #226

Comments

ArvidJB commented May 3, 2022

ArvidJB commented May 20, 2022