You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am working on a file system that loves few huge files and hates many small files. To this end, I would simply set size_limit=None when creating a dataset using a JointWriter. However, shards are only flushed (data written to disk and freed from RAM) once the size_limit is reached. This means I cannot create shards greater than my RAM (because the data in RAM keeps growing and is never flushed). This becomes especially apparent when I write using multiple processes on the same node.
I'd love it if, even with an unlimited shard_size, shard files could be partially written so that I can create shards larger than RAM. I would personally be fine with only MDSWriter and limited compressions supporting this. It seems like its encode_joint_to_shard implementation could support this.
Is this a feature you would accept contributions for or would it create too much maintenance workload with regard to various settings (compression etc.)?
The text was updated successfully, but these errors were encountered:
It's a GPFS. Thanks for the idea to maybe subclass the MDSWriter! Probably the simplest solution in both implementation and maintenance. :)
Cheers for the openness to contributions!
I am working on a file system that loves few huge files and hates many small files. To this end, I would simply set
size_limit=None
when creating a dataset using aJointWriter
. However, shards are only flushed (data written to disk and freed from RAM) once thesize_limit
is reached. This means I cannot create shards greater than my RAM (because the data in RAM keeps growing and is never flushed). This becomes especially apparent when I write using multiple processes on the same node.I'd love it if, even with an unlimited
shard_size
, shard files could be partially written so that I can create shards larger than RAM. I would personally be fine with onlyMDSWriter
and limited compressions supporting this. It seems like itsencode_joint_to_shard
implementation could support this.Is this a feature you would accept contributions for or would it create too much maintenance workload with regard to various settings (compression etc.)?
The text was updated successfully, but these errors were encountered: