Skip to content

Commit

Permalink
Minor doc edits
Browse files Browse the repository at this point in the history
  • Loading branch information
Bulat-Ziganshin committed Jun 18, 2016
1 parent d28523d commit 1d362aa
Show file tree
Hide file tree
Showing 3 changed files with 24 additions and 16 deletions.
20 changes: 13 additions & 7 deletions algo_lzp/README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
[lzp-cpu-bsc.cpp]: lzp-cpu-bsc.cpp
[lzp-cpu-bsc-mod.cpp]: lzp-cpu-bsc-mod.cpp
[lzp-cpu-rollhash.cpp]: lzp-cpu-rollhash.cpp


This directory contains my experiments on fast LZP coder. Current results on enwik9 with i7-4770:
```
Expand All @@ -7,19 +11,21 @@ lzp_cpu_rollhash : 1,000,000,000 => 855,369,315 (85.54%) 476 MiB/s, 20
lzp_cpu_rollhash (OpenMP): 1520 MiB/s, 627.267 ms
```

### [lzp-cpu-bsc.cpp](lzp-cpu-bsc.cpp)
### [lzp-cpu-bsc.cpp]

Original BSC implementation.

### [lzp-cpu-bsc-mod.cpp](lzp-cpu-bsc-mod.cpp)
### [lzp-cpu-bsc-mod.cpp]

Original BSC implementation slightly optimized with low-level x86 tricks.
BSC implementation slightly optimized with low-level x86 tricks.
Output format is incompatible with [lzp-cpu-bsc.cpp] due to use of faster hash function.

### [lzp-cpu-rollhash.cpp](lzp-cpu-rollhash.cpp)
### [lzp-cpu-rollhash.cpp]

Each hash-table entry stores (in addition to a pointer) checksum of minLen bytes it points to.
This allows to spend most of time inside inner loop - in >99% cases, when we are going out
of the inner loop, we have real match. The checksum saved is multiplicative rolling hash
of these bytes.
This allows to spend most of time inside innermost branch-less loop - in >99% cases, when we are going out
of the innermost loop, we have a real match. The checksum saved is multiplicative rolling hash
of those minLen bytes. The output format is compatible with [lzp-cpu-bsc-mod.cpp]. Current implementation
employs 64-bit operations, making it suboptimal on 32-bit platforms.

Plus the same x86 tricks.
2 changes: 1 addition & 1 deletion algo_mtf/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ Further optimizations:

Combined algo:
- check for first 32 ranks using MTF queue in AVX2 register or two SSE2 registers, going into shelwien cycle only for rare ranks>32
- in order to provide sufficient ILP to deal with delays of PCMPEQB+PMOVMSKB+TZCNT+Jxx, interleave processing of 2 symbols from each of 2 blocks
- in order to provide sufficient ILP to deal with latency of PCMPEQB+PMOVMSKB+TZCNT+Jxx, interleave processing of 2 symbols from each of 2 blocks


### CUDA implementations
Expand Down
18 changes: 10 additions & 8 deletions app_bslab/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,9 @@
[BWT]: ../algo_bwt
[MTF]: ../algo_mtf
[EC]: ../algo_ec
[results-cpu.txt]: (results-cpu.txt)
[results-cuda.txt]: (results-cuda.txt)
[profile.txt]: (profile.txt)
[results-cpu.txt]: results-cpu.txt
[results-cuda.txt]: results-cuda.txt
[profile.txt]: profile.txt


BSLab stands for the block-sorting laboratory.
Expand All @@ -19,7 +19,7 @@ It sequentially applies to input data all algorithms employed in real compressor
On every stage (except for RLE) we have a choice of algorithms, including those implemented in BSC as the baseline.
Output of the last algorithm completed on every stage (except for OpenMP LZP) goes as the input to the next stage.
Individual algorithms can be selected with options like -mtf1,3-4, or you may disable some algos with option like -mtf-1,3-4.
You can also completely disable some stages by optiions like -nolzp, and control LZP stage parameters with options -h and -l.
You can also completely disable some stages by options like -nolzp, and control LZP stage parameters with options -h and -l.
Blocksize is controlled by -b option, you may need to reduce it if program exits with memory allocation error.

We have tested the following compilers:
Expand Down Expand Up @@ -51,16 +51,18 @@ rle: 95,006,102 => 36,703,518 (38.63% / 36.70%) >255: 15,171, rank+len: 46,60
This means that data were compressed from 100,000,000 to 95,006,102 bytes at LZP stage,
and then to 36,703,797 bytes at RLE stage, while BWT and MTF stages are 1:1 mappings.
36,703,797 bytes is 38.63% of RLE stage input (95,025,330 bytes) and
36.70% of original data (100,000,000 bytes). Also, we see that
- 15,164 run-length counts (out of total 36,703,797) produced by RLE stage were higher than 255
- 46,604,415 values represent ranks and lengths>1
- 52,271,085 values represent ranks and lengths in 1/2 encoding
36.70% of original data (100,000,000 bytes).

In the speed department, we see that mtf_cpu_bsc algo was finished in 526.415 ms, that's
66.5 MiB/s compared to its input (36,703,797 bytes) and
181 MiB/s compared to original data (100,000,000 bytes).
The alternative MTF algorithm, namely mtf_cpu_shelwien, was finished in just 250.876 ms.

Also, we see that
- 15,164 run-length counts (out of total 36,703,797) produced by RLE stage were higher than 255
- 46,604,415 values represent ranks and lengths>1
- 52,271,085 values represent ranks and lengths in 1/2 encoding


### Full results

Expand Down

0 comments on commit 1d362aa

Please sign in to comment.