forked from mhx/dwarfs
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathTODO
349 lines (262 loc) · 11.1 KB
/
TODO
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
- Forward compatibility
- Feature flags (feature strings)
- Wiki with use cases
- Perl releases
- Videos with shared streams
- Backups of audio files
- Compression of filesystem images for forensic purposes
- Mounting lots of images with shared cache?
inode -> list of fragments
categorizer returns list of fragments OR single category
fragment -> [offset, length, category]
TODO: with PCM audio signals, we need to categorize by:
- # of channels
- bit depth
- endian-ness
- signed-ness
- sample rate (not necessarily)
we also need to keep track of endian-ness and signed-ness
outside of the FLAC stream, as this information isn't
stored in FLAC; we could always get this information from
the metadata of the original file, but that would be a lot
more complicated than just storing two extra bits in the
block itself; we'll have to store the decoded length as
well, so this won't make a real difference
- how do we make multiple parallel block chains (categories)
reproducible?
=> we could re-write the blocks and update the metadata;
however, data added to "uncategorized" blocks would
still be random
=> oooh, easy: we can order all chunks before we start
writing them to blocks, so only the block order is
random after all, which we can fix by rewriting.
the tricky bit is (probably) going to be updating
the chunks in the metadata, as they won't be written
to the individual blocks sequentially; pretty sure
that's a solvable problem.
=> indeed solvable; chunks aren't finalized until after
all blocks have been written. We simply need to subdivide
`inode_`s into smaller parts (as provided by the categorizer)
and then add chunks to the individual parts, assembling them
only after everything has been written
=> as for re-writing the blocks: we need to keep track of
the category of each block, so that we can later sort
the blocks by category; we can do this in a separate
section, which can later be deleted; or we can even keep
it if we want to know later on what category a block
belonged to
- configuration ideas:
--order FILETYPE::...
-B FILETYPE::...
-W FILETYPE::...
-w FILETYPE::...
-C FILETYPE::...
-B pcmaudio::64 -W pcmaudio::16 -C pcmaudio::flac:level=8
-C binary/x86::zstd:filter=x86
-C mime:application/x-archive::null
--filetype pcmaudio::mime:audio/x-wav,mime:audio/x-w64
--categorize=pcmaudio,incompressible,binary,libmagic
--libmagic-types=application/x-archive
- different scenarios for categorized files / chunks:
- Video files
- just store without compression, but perform segmentation, nothing special
- keep in lookback buffer for longer, as it doesn't cost memory
- needs parallelized segmenter (see below)
- PCM audio (see also github #95)
- segment first in case of e.g. different trims or other types of overlap
- split into chunks (for parallel decompression)
- compress each chunk as flac
- headers to be saved separately
- need to store original size and other information
This is actually quite easy:
- Identify PCM audio files (libmagic?)
- Use libsndfile for parsing
- Nilsimsa similarity works surprisingly well
- We can potentially switch to larger window size for segmentation and use
larger lookback
- Group by format (# of channels, resolution, endian-ness, signedness, sample rate)
- Run segmentation as usual
- Compress each block using FLAC (hopefully we can configure how much header data
and/or seek points etc. gets stored) or maybe even WAVPACK is we don't need perf
- I *think* this can be done even with the current metadata format without any
additions
- The features needed should be largely orthogonal to the features needed for the
scenarios below
- Executables, Shared Libs, ...
- run filter over blocks that contain a certain kind of binary data before
compression
- a single binary may contain machine code for different architectures,
so we may have to store different parts of the binary in different blocks
- actually quite similar to audio files above, except for the additional
filter used during compression/decompression
- JPG
- just store a recompressed version
- need to store original size
- no need for segmentation except for exact
- PDF
- decompress contents
- then segment
- then compress along with other PDFs (or documents in general)
- Other compressed format (gz, xz, ...)
- decompress
- segment
- compress
- essentially like PDF
- maybe only do this for small files? (option for size limit?)
- It should be possible to treat individual chunks differently, e.g.
WAV-header should be stored independently from contents; at some
point, we might look deeper into tar files and compress individual
contents differently.
- in the metadata, we need to know:
- the fact that a stored inode is "special" (can be reflected in a single bit)
- the type of "specialness"
- the original file size
- multi-threaded pre-matcher (for -Bn with n > 0)
- pre-compute matches/cyclic hashes for completed blocks; these don't
change and so we can do this with very little synchronization
- there are two possible strategies:
- split the input stream into chunks and then process each chunk in
a separate thread, checking all n blocks
- process the input stream in each thread and then only checking a
subset of past blocks (this seems more wasteful, but each thread
would only operate on a few instead of all bloom filters, which
could be better from a cache locality pov)
- per-file progress for large files?
- throughput indicator
- similarity size limit to avoid similarity computation for huge files
- store files without similarity hash first, sorted descending by size
- allow ordering by *reverse* path
- use streaming interface for zstd decompressor
- json metadata recovery
- add --chmod, --chown
- add some simple filter rules?
- handle sparse files?
- try to be more resilient to modifications of the input while creating fs
- dwarfsck:
- show which entries a block references
- show partial metadata dumps at lower detail levels
- make dwarfsck more usable
- cleanup TODOs
- folly: dynamic should support string_view
- frozen: ViewBase.getPosition() should be const
- docs, moar tests
- extended attributes:
- number of blocks
- number of chunks
- number of times opened?
- per-file "hotness" (how often was a file opened); dump to file upon umount
- --unpack option
- readahead?
- remove multiple blockhash window sizes, one is enough apparently?
- window-increment-shift seems silly to configure?
- identify blocks that contain mostly binary data and adjust compressor?
- metadata stripping (i.e. re-write metadata without owner/time info)
- metadata repacking (e.g. just recompress/decompress the metadata block)
/*
scanner:
bhw= - 388.3s 13.07 GiB
bhw= 8 812.9s 7.559 GiB
bhw= 9 693.1s 7.565 GiB
bhw=10 651.8s 7.617 GiB
bhw=11 618.7s 7.313 GiB
bhw=12 603.6s 7.625 GiB
bhw=13 591.2s 7.858 GiB
bhw=14 574.1s 8.306 GiB
bhw=15 553.8s 8.869 GiB
bhw=16 541.9s 9.529 GiB
lz4:
<---- 1m29.535s / 9m31.212s
lz4hc:
1 - 20.94s - 2546 MiB
2 - 21.67s - 2441 MiB
3 - 24.19s - 2377 MiB
4 - 27.29s - 2337 MiB
5 - 31.49s - 2311 MiB
6 - 36.39s - 2294 MiB
7 - 42.04s - 2284 MiB
8 - 48.67s - 2277 MiB
9 - 56.94s - 2273 MiB <---- 1m27.979s / 9m20.637s
10 - 68.03s - 2271 MiB
11 - 79.54s - 2269 MiB
12 - 94.84s - 2268 MiB
zstd:
1 - 11.42s - 1667 MiB
2 - 12.95s - 1591 MiB <---- 2m8.351s / 15m25.752s
3 - 22.03s - 1454 MiB
4 - 25.64s - 1398 MiB
5 - 32.34s - 1383 MiB
6 - 41.45s - 1118 MiB <---- 2m4.258s / 14m28.627s
7 - 46.26s - 1104 MiB
8 - 53.34s - 1077 MiB
9 - 59.99s - 1066 MiB
10 - 63.3s - 1066 MiB
11 - 66.97s - 956 MiB <---- 2m3.496s / 14m17.862s
12 - 79.89s - 953 MiB
13 - 89.8s - 943 MiB
14 - 118.1s - 941 MiB
15 - 230s - 951 MiB
16 - 247.4s - 863 MiB <---- 2m11.202s / 14m57.245s
17 - 294.5s - 854 MiB
18 - 634s - 806 MiB
19 - 762.5s - 780 MiB
20 - 776.8s - 718 MiB <---- 2m16.448s / 15m43.923s
21 - 990.4s - 716 MiB
22 - 984.3s - 715 MiB <---- 2m18.133s / 15m55.263s
lzma:
level=6:dict_size=21 921.9s - 838.8 MiB <---- 5m11.219s / 37m36.002s
*/
Perl:
542 versions of perl
found/scanned: 152809/152809 dirs, 0/0 links, 1325098/1325098 files
original size: 32.03 GiB, saved: 19.01 GiB by deduplication (1133032 duplicate files), 5.835 GiB by segmenting
filesystem size: 7.183 GiB in 460 blocks (499389 chunks, 192066/192066 inodes), 460 blocks/662.3 MiB written
bench
build real user
-----------------------------------------------------------------------------------------------------
-rw-r--r-- 1 mhx users 14G Jul 27 23:11 perl-install-0.dwarfs 8:05 0:38 0:45
-rw-r--r-- 1 mhx users 4.8G Jul 27 23:18 perl-install-1.dwarfs 6:34 0:14 1:24
-rw-r--r-- 1 mhx users 3.8G Jul 27 23:26 perl-install-2.dwarfs 7:31 0:17 1:11
-rw-r--r-- 1 mhx users 3.2G Jul 27 23:36 perl-install-3.dwarfs 10:11 0:11 0:59
-rw-r--r-- 1 mhx users 1.8G Jul 27 23:47 perl-install-4.dwarfs 11:05 0:14 1:24
-rw-r--r-- 1 mhx users 1.2G Jul 27 23:59 perl-install-5.dwarfs 11:53 0:13 1:15
-rw-r--r-- 1 mhx users 901M Jul 28 00:16 perl-install-6.dwarfs 17:42 0:14 1:25
-rw-r--r-- 1 mhx users 704M Jul 28 00:37 perl-install-7.dwarfs 20:52 0:20 2:14
-rw-r--r-- 1 mhx users 663M Jul 28 04:04 perl-install-8.dwarfs 24:13 0:50 6:02
-rw-r--r-- 1 mhx users 615M Jul 28 02:50 perl-install-9.dwarfs 34:40 0:51 5:50
-rw-r--r-- 1 mhx users 3.6G Jul 28 09:13 perl-install-defaults.squashfs 17:20
-rw-r--r-- 1 mhx users 2.4G Jul 28 10:42 perl-install-opt.squashfs 71:49
soak:
-7 (cache=1g)
Passed with 542 of 542 combinations.
real 75m21.191s
user 68m3.903s
sys 6m21.020s
-9 (cache=1g)
Passed with 542 of 542 combinations.
real 118m48.371s
user 107m35.685s
sys 7m16.438s
squashfs-opt
real 81m36.957s
user 62m37.369s
sys 20m52.367s
-1 (cache=2g)
mhx@gimli ~ $ time find tmp/mount/ -type f | xargs -n 1 -P 32 -d $'\n' -I {} dd of=/dev/null if={} bs=64K status=none
real 2m19.927s
user 0m16.813s
sys 2m4.293s
-7 (cache=2g)
mhx@gimli ~ $ time find tmp/mount/ -type f | xargs -n 1 -P 32 -d $'\n' -I {} dd of=/dev/null if={} bs=64K status=none
real 2m24.346s
user 0m17.007s
sys 1m59.823s
squash-default
mhx@gimli ~ $ time find tmp/mount/ -type f | xargs -n 1 -P 32 -d $'\n' -I {} dd of=/dev/null if={} bs=64K status=none
real 8m41.594s
user 1m25.346s
sys 19m12.036s
squash-opt
mhx@gimli ~ $ time find tmp/mount/ -type f | xargs -n 1 -P 32 -d $'\n' -I {} dd of=/dev/null if={} bs=64K status=none
real 141m41.092s
user 1m12.650s
sys 59m18.194s