TODO

- Forward compatibility

  - Feature flags (feature strings)

- Wiki with use cases
  - Perl releases
  - Videos with shared streams
  - Backups of audio files
  - Compression of filesystem images for forensic purposes

- Mounting lots of images with shared cache?

inode -> list of fragments

categorizer returns list of fragments OR single category

fragment -> [offset, length, category]


TODO: with PCM audio signals, we need to categorize by:

      - # of channels
      - bit depth
      - endian-ness
      - signed-ness
      - sample rate (not necessarily)

      we also need to keep track of endian-ness and signed-ness
      outside of the FLAC stream, as this information isn't
      stored in FLAC; we could always get this information from
      the metadata of the original file, but that would be a lot
      more complicated than just storing two extra bits in the
      block itself; we'll have to store the decoded length as
      well, so this won't make a real difference

- how do we make multiple parallel block chains (categories)
  reproducible?

    => we could re-write the blocks and update the metadata;
       however, data added to "uncategorized" blocks would
       still be random

    => oooh, easy: we can order all chunks before we start
       writing them to blocks, so only the block order is
       random after all, which we can fix by rewriting.
       the tricky bit is (probably) going to be updating
       the chunks in the metadata, as they won't be written
       to the individual blocks sequentially; pretty sure
       that's a solvable problem.

    => indeed solvable; chunks aren't finalized until after
       all blocks have been written. We simply need to subdivide
       `inode_`s into smaller parts (as provided by the categorizer)
       and then add chunks to the individual parts, assembling them
       only after everything has been written

    => as for re-writing the blocks: we need to keep track of
       the category of each block, so that we can later sort
       the blocks by category; we can do this in a separate
       section, which can later be deleted; or we can even keep
       it if we want to know later on what category a block
       belonged to

- configuration ideas:

  --order FILETYPE::...
  -B FILETYPE::...
  -W FILETYPE::...
  -w FILETYPE::...
  -C FILETYPE::...

  -B pcmaudio::64 -W pcmaudio::16 -C pcmaudio::flac:level=8
  -C binary/x86::zstd:filter=x86
  -C mime:application/x-archive::null

  --filetype pcmaudio::mime:audio/x-wav,mime:audio/x-w64


  --categorize=pcmaudio,incompressible,binary,libmagic
  --libmagic-types=application/x-archive


- different scenarios for categorized files / chunks:

  - Video files
    - just store without compression, but perform segmentation, nothing special
    - keep in lookback buffer for longer, as it doesn't cost memory
    - needs parallelized segmenter (see below)

  - PCM audio (see also github #95)
    - segment first in case of e.g. different trims or other types of overlap
    - split into chunks (for parallel decompression)
    - compress each chunk as flac
    - headers to be saved separately
    - need to store original size and other information

    This is actually quite easy:

    - Identify PCM audio files (libmagic?)
    - Use libsndfile for parsing
    - Nilsimsa similarity works surprisingly well
    - We can potentially switch to larger window size for segmentation and use
      larger lookback
    - Group by format (# of channels, resolution, endian-ness, signedness, sample rate)
    - Run segmentation as usual
    - Compress each block using FLAC (hopefully we can configure how much header data
      and/or seek points etc. gets stored) or maybe even WAVPACK is we don't need perf
    - I *think* this can be done even with the current metadata format without any
      additions
    - The features needed should be largely orthogonal to the features needed for the
      scenarios below

  - Executables, Shared Libs, ...
    - run filter over blocks that contain a certain kind of binary data before
      compression
    - a single binary may contain machine code for different architectures,
      so we may have to store different parts of the binary in different blocks
    - actually quite similar to audio files above, except for the additional
      filter used during compression/decompression

  - JPG
    - just store a recompressed version
    - need to store original size
    - no need for segmentation except for exact 

  - PDF
    - decompress contents
    - then segment
    - then compress along with other PDFs (or documents in general)

  - Other compressed format (gz, xz, ...)
    - decompress
    - segment
    - compress
    - essentially like PDF
    - maybe only do this for small files? (option for size limit?)

  - It should be possible to treat individual chunks differently, e.g.
    WAV-header should be stored independently from contents; at some
    point, we might look deeper into tar files and compress individual
    contents differently.

- in the metadata, we need to know:

  - the fact that a stored inode is "special" (can be reflected in a single bit)
  - the type of "specialness"
  - the original file size


- multi-threaded pre-matcher (for -Bn with n > 0)
  - pre-compute matches/cyclic hashes for completed blocks; these don't
    change and so we can do this with very little synchronization
  - there are two possible strategies:
    - split the input stream into chunks and then process each chunk in
      a separate thread, checking all n blocks
    - process the input stream in each thread and then only checking a
      subset of past blocks (this seems more wasteful, but each thread
      would only operate on a few instead of all bloom filters, which
      could be better from a cache locality pov)

- per-file progress for large files?
- throughput indicator

- similarity size limit to avoid similarity computation for huge files
- store files without similarity hash first, sorted descending by size
- allow ordering by *reverse* path


- use streaming interface for zstd decompressor
- json metadata recovery
- add --chmod, --chown
- add some simple filter rules?
- handle sparse files?
- try to be more resilient to modifications of the input while creating fs

- dwarfsck:
  - show which entries a block references
  - show partial metadata dumps at lower detail levels

- make dwarfsck more usable
- cleanup TODOs

- folly: dynamic should support string_view
- frozen: ViewBase.getPosition() should be const

- docs, moar tests

- extended attributes:
  - number of blocks
  - number of chunks
  - number of times opened?

- per-file "hotness" (how often was a file opened); dump to file upon umount

- --unpack option

- readahead?

- remove multiple blockhash window sizes, one is enough apparently?

- window-increment-shift seems silly to configure?

- identify blocks that contain mostly binary data and adjust compressor?

- metadata stripping (i.e. re-write metadata without owner/time info)

- metadata repacking (e.g. just recompress/decompress the metadata block)


/*

scanner:
bhw= -   388.3s  13.07  GiB
bhw= 8   812.9s   7.559 GiB
bhw= 9   693.1s   7.565 GiB
bhw=10   651.8s   7.617 GiB
bhw=11   618.7s   7.313 GiB
bhw=12   603.6s   7.625 GiB
bhw=13   591.2s   7.858 GiB
bhw=14   574.1s   8.306 GiB
bhw=15   553.8s   8.869 GiB
bhw=16   541.9s   9.529 GiB


lz4:
                          <----  1m29.535s / 9m31.212s

lz4hc:
 1 -  20.94s - 2546 MiB
 2 -  21.67s - 2441 MiB
 3 -  24.19s - 2377 MiB
 4 -  27.29s - 2337 MiB
 5 -  31.49s - 2311 MiB
 6 -  36.39s - 2294 MiB
 7 -  42.04s - 2284 MiB
 8 -  48.67s - 2277 MiB
 9 -  56.94s - 2273 MiB  <----  1m27.979s / 9m20.637s
10 -  68.03s - 2271 MiB
11 -  79.54s - 2269 MiB
12 -  94.84s - 2268 MiB

zstd:
 1 -  11.42s - 1667 MiB
 2 -  12.95s - 1591 MiB  <----  2m8.351s / 15m25.752s
 3 -  22.03s - 1454 MiB
 4 -  25.64s - 1398 MiB
 5 -  32.34s - 1383 MiB
 6 -  41.45s - 1118 MiB  <----  2m4.258s / 14m28.627s
 7 -  46.26s - 1104 MiB
 8 -  53.34s - 1077 MiB
 9 -  59.99s - 1066 MiB
10 -  63.3s  - 1066 MiB
11 -  66.97s -  956 MiB  <----  2m3.496s / 14m17.862s
12 -  79.89s -  953 MiB
13 -  89.8s  -  943 MiB
14 - 118.1s  -  941 MiB
15 - 230s    -  951 MiB
16 - 247.4s  -  863 MiB  <----  2m11.202s / 14m57.245s
17 - 294.5s  -  854 MiB
18 - 634s    -  806 MiB
19 - 762.5s  -  780 MiB
20 - 776.8s  -  718 MiB  <----  2m16.448s / 15m43.923s
21 - 990.4s  -  716 MiB
22 - 984.3s  -  715 MiB  <----  2m18.133s / 15m55.263s

lzma:
level=6:dict_size=21  921.9s  - 838.8 MiB  <----  5m11.219s / 37m36.002s

*/

Perl:
542 versions of perl
found/scanned: 152809/152809 dirs, 0/0 links, 1325098/1325098 files
original size: 32.03 GiB, saved: 19.01 GiB by deduplication (1133032 duplicate files), 5.835 GiB by segmenting
filesystem size: 7.183 GiB in 460 blocks (499389 chunks, 192066/192066 inodes), 460 blocks/662.3 MiB written

                                                                                   bench
                                                                         build   real  user
-----------------------------------------------------------------------------------------------------
-rw-r--r-- 1 mhx users  14G Jul 27 23:11 perl-install-0.dwarfs            8:05   0:38  0:45
-rw-r--r-- 1 mhx users 4.8G Jul 27 23:18 perl-install-1.dwarfs            6:34   0:14  1:24
-rw-r--r-- 1 mhx users 3.8G Jul 27 23:26 perl-install-2.dwarfs            7:31   0:17  1:11
-rw-r--r-- 1 mhx users 3.2G Jul 27 23:36 perl-install-3.dwarfs           10:11   0:11  0:59
-rw-r--r-- 1 mhx users 1.8G Jul 27 23:47 perl-install-4.dwarfs           11:05   0:14  1:24
-rw-r--r-- 1 mhx users 1.2G Jul 27 23:59 perl-install-5.dwarfs           11:53   0:13  1:15
-rw-r--r-- 1 mhx users 901M Jul 28 00:16 perl-install-6.dwarfs           17:42   0:14  1:25
-rw-r--r-- 1 mhx users 704M Jul 28 00:37 perl-install-7.dwarfs           20:52   0:20  2:14
-rw-r--r-- 1 mhx users 663M Jul 28 04:04 perl-install-8.dwarfs           24:13   0:50  6:02
-rw-r--r-- 1 mhx users 615M Jul 28 02:50 perl-install-9.dwarfs           34:40   0:51  5:50

-rw-r--r-- 1 mhx users 3.6G Jul 28 09:13 perl-install-defaults.squashfs  17:20
-rw-r--r-- 1 mhx users 2.4G Jul 28 10:42 perl-install-opt.squashfs       71:49


soak:

-7  (cache=1g)

Passed with 542 of 542 combinations.

real    75m21.191s
user    68m3.903s
sys     6m21.020s

-9  (cache=1g)

Passed with 542 of 542 combinations.

real    118m48.371s
user    107m35.685s
sys     7m16.438s

squashfs-opt

real    81m36.957s
user    62m37.369s
sys     20m52.367s


-1  (cache=2g)
mhx@gimli ~ $ time find tmp/mount/ -type f | xargs -n 1 -P 32 -d $'\n' -I {} dd of=/dev/null if={} bs=64K status=none

real    2m19.927s
user    0m16.813s
sys     2m4.293s

-7  (cache=2g)
mhx@gimli ~ $ time find tmp/mount/ -type f | xargs -n 1 -P 32 -d $'\n' -I {} dd of=/dev/null if={} bs=64K status=none

real    2m24.346s
user    0m17.007s
sys     1m59.823s

squash-default
mhx@gimli ~ $ time find tmp/mount/ -type f | xargs -n 1 -P 32 -d $'\n' -I {} dd of=/dev/null if={} bs=64K status=none

real    8m41.594s
user    1m25.346s
sys     19m12.036s

squash-opt
mhx@gimli ~ $ time find tmp/mount/ -type f | xargs -n 1 -P 32 -d $'\n' -I {} dd of=/dev/null if={} bs=64K status=none

real    141m41.092s
user    1m12.650s
sys     59m18.194s