Minor doc edits

Bulat-Ziganshin · Jun 18, 2016 · 1d362aa · 1d362aa
1 parent d28523d
commit 1d362aa
Show file tree

Hide file tree

Showing 3 changed files with 24 additions and 16 deletions.
diff --git a/algo_lzp/README.md b/algo_lzp/README.md
@@ -1,3 +1,7 @@
+[lzp-cpu-bsc.cpp]:        lzp-cpu-bsc.cpp
+[lzp-cpu-bsc-mod.cpp]:    lzp-cpu-bsc-mod.cpp
+[lzp-cpu-rollhash.cpp]:   lzp-cpu-rollhash.cpp
+
 
 This directory contains my experiments on fast LZP coder. Current results on enwik9 with i7-4770:
 ```
@@ -7,19 +11,21 @@ lzp_cpu_rollhash         : 1,000,000,000 => 855,369,315 (85.54%)  476 MiB/s,  20
 lzp_cpu_rollhash (OpenMP):                                       1520 MiB/s,   627.267 ms
 ```
 
-### [lzp-cpu-bsc.cpp](lzp-cpu-bsc.cpp)
+### [lzp-cpu-bsc.cpp]
 
 Original BSC implementation.
 
-### [lzp-cpu-bsc-mod.cpp](lzp-cpu-bsc-mod.cpp)
+### [lzp-cpu-bsc-mod.cpp]
 
-Original BSC implementation slightly optimized with low-level x86 tricks.
+BSC implementation slightly optimized with low-level x86 tricks. 
+Output format is incompatible with [lzp-cpu-bsc.cpp] due to use of faster hash function.
 
-### [lzp-cpu-rollhash.cpp](lzp-cpu-rollhash.cpp)
+### [lzp-cpu-rollhash.cpp]
 
 Each hash-table entry stores (in addition to a pointer) checksum of minLen bytes it points to.
-This allows to spend most of time inside inner loop - in >99% cases, when we are going out
-of the inner loop, we have real match. The checksum saved is multiplicative rolling hash
-of these bytes.
+This allows to spend most of time inside innermost branch-less loop - in >99% cases, when we are going out
+of the innermost loop, we have a real match. The checksum saved is multiplicative rolling hash
+of those minLen bytes. The output format is compatible with [lzp-cpu-bsc-mod.cpp]. Current implementation
+employs 64-bit operations, making it suboptimal on 32-bit platforms.
 
 Plus the same x86 tricks.
diff --git a/algo_mtf/README.md b/algo_mtf/README.md
@@ -24,7 +24,7 @@ Further optimizations:
 
 Combined algo:
 - check for first 32 ranks using MTF queue in AVX2 register or two SSE2 registers, going into shelwien cycle only for rare ranks>32
-- in order to provide sufficient ILP to deal with delays of PCMPEQB+PMOVMSKB+TZCNT+Jxx, interleave processing of 2 symbols from each of 2 blocks
+- in order to provide sufficient ILP to deal with latency of PCMPEQB+PMOVMSKB+TZCNT+Jxx, interleave processing of 2 symbols from each of 2 blocks
 
 
 ### CUDA implementations

diff --git a/app_bslab/README.md b/app_bslab/README.md
@@ -3,9 +3,9 @@
 [BWT]:   ../algo_bwt
 [MTF]:   ../algo_mtf
 [EC]:    ../algo_ec
-[results-cpu.txt]:   (results-cpu.txt)
-[results-cuda.txt]:  (results-cuda.txt)
-[profile.txt]:       (profile.txt)
+[results-cpu.txt]:   results-cpu.txt
+[results-cuda.txt]:  results-cuda.txt
+[profile.txt]:       profile.txt
 
 
 BSLab stands for the block-sorting laboratory.
@@ -19,7 +19,7 @@ It sequentially applies to input data all algorithms employed in real compressor
 On every stage (except for RLE) we have a choice of algorithms, including those implemented in BSC as the baseline.
 Output of the last algorithm completed on every stage (except for OpenMP LZP) goes as the input to the next stage.
 Individual algorithms can be selected with options like -mtf1,3-4, or you may disable some algos with option like -mtf-1,3-4.
-You can also completely disable some stages by optiions like -nolzp, and control LZP stage parameters with options -h and -l.
+You can also completely disable some stages by options like -nolzp, and control LZP stage parameters with options -h and -l.
 Blocksize is controlled by -b option, you may need to reduce it if program exits with memory allocation error.
 
 We have tested the following compilers:
@@ -51,16 +51,18 @@ rle: 95,006,102 => 36,703,518 (38.63% / 36.70%)   >255: 15,171,  rank+len: 46,60
 This means that data were compressed from 100,000,000 to 95,006,102 bytes at LZP stage, 
 and then to 36,703,797 bytes at RLE stage, while BWT and MTF stages are 1:1 mappings.
 36,703,797 bytes is 38.63% of RLE stage input (95,025,330 bytes) and 
-36.70% of original data (100,000,000 bytes). Also, we see that 
-- 15,164 run-length counts (out of total 36,703,797) produced by RLE stage were higher than 255
-- 46,604,415 values represent ranks and lengths>1
-- 52,271,085 values represent ranks and lengths in 1/2 encoding
+36.70% of original data (100,000,000 bytes).
 
 In the speed department, we see that mtf_cpu_bsc algo was finished in 526.415 ms, that's 
 66.5 MiB/s compared to its input (36,703,797 bytes) and
 181 MiB/s compared to original data (100,000,000 bytes).
 The alternative MTF algorithm, namely mtf_cpu_shelwien, was finished in just 250.876 ms.
 
+Also, we see that 
+- 15,164 run-length counts (out of total 36,703,797) produced by RLE stage were higher than 255
+- 46,604,415 values represent ranks and lengths>1
+- 52,271,085 values represent ranks and lengths in 1/2 encoding
+
 
 ### Full results