diff --git a/algo_st/README.md b/algo_st/README.md index c94b65f..eb7c522 100644 --- a/algo_st/README.md +++ b/algo_st/README.md @@ -9,8 +9,8 @@ Further optimizations: Use some combination of the following ideas to shave off remaining times over 65 ms - overload pre/post-sorting procedures and RLE compression with memcpy -- process only 4-byte elements at last sorting stages, and simultaneously copy-in next block to process - -4-byte sorting should also be faster than sorting of 4+4 (key+value) bytes (43 ms) +- process only 4-byte elements at last sorting stages, and simultaneously copy-in next block to process - +4-byte sorting should also be faster than sorting of 4+4 (key+value) bytes (43 ms total instead of 65 ms) - use zero-copy memory instead of copy in/out So, after all optimizations, ST4 should become more than 3x faster! diff --git a/app_bslab/README.md b/app_bslab/README.md index 9041960..5b29b45 100644 --- a/app_bslab/README.md +++ b/app_bslab/README.md @@ -6,6 +6,7 @@ [results-cpu.txt]: results-cpu.txt [results-cuda.txt]: results-cuda.txt [profile.txt]: profile.txt +[bench.cmd]: bench.cmd BSLab stands for the block-sorting laboratory. @@ -70,6 +71,8 @@ Also, we see that - [results-cuda.txt] are my CUDA GPU results - [profile.txt] are my CUDA GPU profiling report (only MTF kernels are included) +See [bench.cmd] for benchmarking/profiling cmdlines. + ### x64: enwik9 results on Haswell i7-4770 ``` diff --git a/app_bslab/bench.cmd b/app_bslab/bench.cmd index b04a5ba..1dc50bc 100644 --- a/app_bslab/bench.cmd +++ b/app_bslab/bench.cmd @@ -1,2 +1,3 @@ for %a in (boost e8 e9 100m 1g 1g.tor3) do for %x in (-x64-avx2.exe -x64.exe -avx2.exe .exe) do for %c in (icl clang gcc msvc) do bslab-%c%x z:\%a for %a in (boost e8 e9 100m 1g 1g.tor3) do for %e in (bslab-cuda-x64.exe bslab-cuda.exe) do %e -nogpu z:\%a +"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\bin\nvprof.exe" --events all --metrics all --log-file profile --replay-mode application bslab-cuda-x64.exe -bwt11 -lzp3 -mtf-1-4 z:\e8 diff --git a/app_bslab/bslab.cpp b/app_bslab/bslab.cpp index 833c566..175eba7 100644 --- a/app_bslab/bslab.cpp +++ b/app_bslab/bslab.cpp @@ -127,7 +127,7 @@ int main (int argc, char **argv) bufsize <<= 20; // if value is small enough, consider it as mebibytes if (!(argc==2 || argc==3) || error) { - printf ("BSL: the block-sorting lab. Part of https://github.com/Bulat-Ziganshin/Compression-Research\n" + printf ("BSL: the block-sorting lab 1.0 (June 18 2016). Part of https://github.com/Bulat-Ziganshin/Compression-Research\n" "Usage: bsl [options] infile [outfile]\n" " -bN buffer N (mebi)bytes (default %d MiB - reduce if program fails)\n" " -nogpu skip GPU name output\n" diff --git a/app_radix_sort/README.md b/app_radix_sort/README.md index 7982d83..08085d5 100644 --- a/app_radix_sort/README.md +++ b/app_radix_sort/README.md @@ -10,52 +10,53 @@ First column has the format "N/K" for sorting K-byte keys by N bytes. It has format "N/K+V" for sorting with extra V-byte values attached to the keys. -### Current x86 results with CUDA 7.5 and CUB 1.5.2 +### x64 results with CUDA 8.0RC and CUB 1.5.2 -(x64 version is a few percents slower due to need to manage larger pointers) +(x64 version is a few percents slower than x86 due to need to manage larger pointers). +For full results see [results.txt](results.txt). ``` GeForce GTX 560 Ti, CC 2.1. VRAM 1.0 GB, 2004 MHz * 256-bit = 128 GB/s. 8 SM * 48 alu * 1800 MHz * 2 = 1.38 TFLOPS Sorting 16M elements: -1/4 : Throughput = 3630.096 MElements/s, Time = 4.622 ms -2/4 : Throughput = 1807.721 MElements/s, Time = 9.281 ms -3/4 : Throughput = 1325.778 MElements/s, Time = 12.655 ms -4/4 : Throughput = 941.682 MElements/s, Time = 17.816 ms - -1/8 : Throughput = 2033.248 MElements/s, Time = 8.251 ms -2/8 : Throughput = 1013.995 MElements/s, Time = 16.546 ms -3/8 : Throughput = 729.117 MElements/s, Time = 23.010 ms -4/8 : Throughput = 525.525 MElements/s, Time = 31.925 ms -5/8 : Throughput = 442.132 MElements/s, Time = 37.946 ms -6/8 : Throughput = 361.177 MElements/s, Time = 46.452 ms -7/8 : Throughput = 305.574 MElements/s, Time = 54.904 ms -8/8 : Throughput = 271.861 MElements/s, Time = 61.712 ms - -1/4+4: Throughput = 2345.812 MElements/s, Time = 7.152 ms -2/4+4: Throughput = 1173.353 MElements/s, Time = 14.299 ms -3/4+4: Throughput = 874.986 MElements/s, Time = 19.174 ms -4/4+4: Throughput = 609.576 MElements/s, Time = 27.523 ms - -1/4+8: Throughput = 1737.907 MElements/s, Time = 9.654 ms -2/4+8: Throughput = 869.434 MElements/s, Time = 19.297 ms -3/4+8: Throughput = 577.189 MElements/s, Time = 29.067 ms -4/4+8: Throughput = 428.730 MElements/s, Time = 39.132 ms - -1/8+4: Throughput = 1483.201 MElements/s, Time = 11.311 ms -2/8+4: Throughput = 743.381 MElements/s, Time = 22.569 ms -3/8+4: Throughput = 517.357 MElements/s, Time = 32.429 ms -4/8+4: Throughput = 378.284 MElements/s, Time = 44.351 ms -5/8+4: Throughput = 312.690 MElements/s, Time = 53.654 ms -6/8+4: Throughput = 258.132 MElements/s, Time = 64.995 ms -7/8+4: Throughput = 220.360 MElements/s, Time = 76.135 ms -8/8+4: Throughput = 194.696 MElements/s, Time = 86.171 ms - -1/8+8: Throughput = 1261.976 MElements/s, Time = 13.294 ms -2/8+8: Throughput = 630.888 MElements/s, Time = 26.593 ms -3/8+8: Throughput = 421.866 MElements/s, Time = 39.769 ms -4/8+8: Throughput = 315.385 MElements/s, Time = 53.196 ms -5/8+8: Throughput = 256.326 MElements/s, Time = 65.453 ms -6/8+8: Throughput = 213.615 MElements/s, Time = 78.539 ms -7/8+8: Throughput = 183.989 MElements/s, Time = 91.186 ms -8/8+8: Throughput = 161.552 MElements/s, Time = 103.851 ms +1/4 : Throughput = 3532.966 MElements/s, Time = 4.749 ms +2/4 : Throughput = 1765.983 MElements/s, Time = 9.500 ms +3/4 : Throughput = 1298.415 MElements/s, Time = 12.921 ms +4/4 : Throughput = 921.279 MElements/s, Time = 18.211 ms + +1/8 : Throughput = 1976.709 MElements/s, Time = 8.487 ms +2/8 : Throughput = 988.398 MElements/s, Time = 16.974 ms +3/8 : Throughput = 715.334 MElements/s, Time = 23.454 ms +4/8 : Throughput = 515.346 MElements/s, Time = 32.555 ms +5/8 : Throughput = 434.126 MElements/s, Time = 38.646 ms +6/8 : Throughput = 354.241 MElements/s, Time = 47.361 ms +7/8 : Throughput = 299.286 MElements/s, Time = 56.057 ms +8/8 : Throughput = 266.688 MElements/s, Time = 62.909 ms + +1/4+4: Throughput = 2346.395 MElements/s, Time = 7.150 ms +2/4+4: Throughput = 1170.342 MElements/s, Time = 14.335 ms +3/4+4: Throughput = 868.724 MElements/s, Time = 19.312 ms +4/4+4: Throughput = 606.442 MElements/s, Time = 27.665 ms + +1/4+8: Throughput = 1731.703 MElements/s, Time = 9.688 ms +2/4+8: Throughput = 868.007 MElements/s, Time = 19.328 ms +3/4+8: Throughput = 574.224 MElements/s, Time = 29.217 ms +4/4+8: Throughput = 425.807 MElements/s, Time = 39.401 ms + +1/8+4: Throughput = 1447.258 MElements/s, Time = 11.592 ms +2/8+4: Throughput = 725.238 MElements/s, Time = 23.133 ms +3/8+4: Throughput = 506.463 MElements/s, Time = 33.126 ms +4/8+4: Throughput = 370.368 MElements/s, Time = 45.299 ms +5/8+4: Throughput = 306.365 MElements/s, Time = 54.762 ms +6/8+4: Throughput = 252.914 MElements/s, Time = 66.336 ms +7/8+4: Throughput = 215.716 MElements/s, Time = 77.774 ms +8/8+4: Throughput = 190.831 MElements/s, Time = 87.917 ms + +1/8+8: Throughput = 1255.114 MElements/s, Time = 13.367 ms +2/8+8: Throughput = 627.455 MElements/s, Time = 26.739 ms +3/8+8: Throughput = 418.790 MElements/s, Time = 40.061 ms +4/8+8: Throughput = 312.513 MElements/s, Time = 53.685 ms +5/8+8: Throughput = 254.506 MElements/s, Time = 65.921 ms +6/8+8: Throughput = 212.192 MElements/s, Time = 79.066 ms +7/8+8: Throughput = 183.198 MElements/s, Time = 91.579 ms +8/8+8: Throughput = 160.342 MElements/s, Time = 104.634 ms ``` diff --git a/app_radix_sort/results.txt b/app_radix_sort/results.txt index f2e550c..6ba8e8f 100644 --- a/app_radix_sort/results.txt +++ b/app_radix_sort/results.txt @@ -1,98 +1,201 @@ -Full results (which proves that keys/values of 1-2 bytes need the same time as 4-byte ones): - -GeForce GTX 560 Ti, CC 2.1. VRAM 1.0 GB, 2004 MHz * 256-bit = 128 GB/s. 8 SM * 48 alu * 1800 MHz * 2 = 1.38 TFLOPS -Sorting 16M elements: -1/1 : Throughput = 3601.912 MElements/s, Time = 4.658 ms - -1/2 : Throughput = 4002.030 MElements/s, Time = 4.192 ms -2/2 : Throughput = 1997.847 MElements/s, Time = 8.398 ms - -1/4 : Throughput = 3630.318 MElements/s, Time = 4.621 ms -2/4 : Throughput = 1809.568 MElements/s, Time = 9.271 ms -3/4 : Throughput = 1315.988 MElements/s, Time = 12.749 ms -4/4 : Throughput = 941.813 MElements/s, Time = 17.814 ms - -1/8 : Throughput = 2033.593 MElements/s, Time = 8.250 ms -2/8 : Throughput = 1004.249 MElements/s, Time = 16.706 ms -3/8 : Throughput = 726.594 MElements/s, Time = 23.090 ms -4/8 : Throughput = 525.735 MElements/s, Time = 31.912 ms -5/8 : Throughput = 440.026 MElements/s, Time = 38.128 ms -6/8 : Throughput = 362.201 MElements/s, Time = 46.320 ms -7/8 : Throughput = 304.209 MElements/s, Time = 55.150 ms -8/8 : Throughput = 268.156 MElements/s, Time = 62.565 ms - -1/1+1: Throughput = 2371.609 MElements/s, Time = 7.074 ms - -1/1+2: Throughput = 2430.702 MElements/s, Time = 6.902 ms - -1/1+4: Throughput = 2404.523 MElements/s, Time = 6.977 ms - -1/1+8: Throughput = 1743.225 MElements/s, Time = 9.624 ms - -1/2+1: Throughput = 2554.303 MElements/s, Time = 6.568 ms -2/2+1: Throughput = 1272.882 MElements/s, Time = 13.180 ms - -1/2+2: Throughput = 2604.823 MElements/s, Time = 6.441 ms -2/2+2: Throughput = 1298.151 MElements/s, Time = 12.924 ms - -1/2+4: Throughput = 2547.039 MElements/s, Time = 6.587 ms -2/2+4: Throughput = 1261.872 MElements/s, Time = 13.296 ms - -1/2+8: Throughput = 1835.478 MElements/s, Time = 9.141 ms -2/2+8: Throughput = 913.174 MElements/s, Time = 18.372 ms - -1/4+1: Throughput = 2362.364 MElements/s, Time = 7.102 ms -2/4+1: Throughput = 1188.253 MElements/s, Time = 14.119 ms -3/4+1: Throughput = 866.168 MElements/s, Time = 19.369 ms -4/4+1: Throughput = 614.175 MElements/s, Time = 27.317 ms - -1/4+2: Throughput = 2442.646 MElements/s, Time = 6.868 ms -2/4+2: Throughput = 1206.889 MElements/s, Time = 13.901 ms -3/4+2: Throughput = 903.598 MElements/s, Time = 18.567 ms -4/4+2: Throughput = 628.817 MElements/s, Time = 26.681 ms - -1/4+4: Throughput = 2346.267 MElements/s, Time = 7.151 ms -2/4+4: Throughput = 1172.982 MElements/s, Time = 14.303 ms -3/4+4: Throughput = 868.528 MElements/s, Time = 19.317 ms -4/4+4: Throughput = 610.944 MElements/s, Time = 27.461 ms - -1/4+8: Throughput = 1730.178 MElements/s, Time = 9.697 ms -2/4+8: Throughput = 864.963 MElements/s, Time = 19.396 ms -3/4+8: Throughput = 577.441 MElements/s, Time = 29.054 ms -4/4+8: Throughput = 426.736 MElements/s, Time = 39.315 ms - -1/8+1: Throughput = 1449.449 MElements/s, Time = 11.575 ms -2/8+1: Throughput = 739.456 MElements/s, Time = 22.689 ms -3/8+1: Throughput = 510.996 MElements/s, Time = 32.832 ms -4/8+1: Throughput = 374.603 MElements/s, Time = 44.787 ms -5/8+1: Throughput = 309.080 MElements/s, Time = 54.281 ms -6/8+1: Throughput = 256.512 MElements/s, Time = 65.405 ms -7/8+1: Throughput = 217.692 MElements/s, Time = 77.069 ms -8/8+1: Throughput = 193.225 MElements/s, Time = 86.827 ms - -1/8+2: Throughput = 1514.916 MElements/s, Time = 11.075 ms -2/8+2: Throughput = 755.609 MElements/s, Time = 22.204 ms -3/8+2: Throughput = 523.234 MElements/s, Time = 32.064 ms -4/8+2: Throughput = 386.215 MElements/s, Time = 43.440 ms -5/8+2: Throughput = 317.681 MElements/s, Time = 52.811 ms -6/8+2: Throughput = 262.326 MElements/s, Time = 63.956 ms -7/8+2: Throughput = 224.332 MElements/s, Time = 74.787 ms -8/8+2: Throughput = 198.986 MElements/s, Time = 84.314 ms - -1/8+4: Throughput = 1486.800 MElements/s, Time = 11.284 ms -2/8+4: Throughput = 741.701 MElements/s, Time = 22.620 ms -3/8+4: Throughput = 516.033 MElements/s, Time = 32.512 ms -4/8+4: Throughput = 376.603 MElements/s, Time = 44.549 ms -5/8+4: Throughput = 312.734 MElements/s, Time = 53.647 ms -6/8+4: Throughput = 258.123 MElements/s, Time = 64.997 ms -7/8+4: Throughput = 220.913 MElements/s, Time = 75.945 ms -8/8+4: Throughput = 194.207 MElements/s, Time = 86.388 ms - -1/8+8: Throughput = 1255.632 MElements/s, Time = 13.362 ms -2/8+8: Throughput = 629.547 MElements/s, Time = 26.650 ms -3/8+8: Throughput = 421.853 MElements/s, Time = 39.770 ms -4/8+8: Throughput = 314.744 MElements/s, Time = 53.304 ms -5/8+8: Throughput = 256.440 MElements/s, Time = 65.424 ms -6/8+8: Throughput = 211.182 MElements/s, Time = 79.444 ms -7/8+8: Throughput = 183.147 MElements/s, Time = 91.605 ms -8/8+8: Throughput = 160.791 MElements/s, Time = 104.342 ms +Full results with CUDA 8.0RC and CUB 1.5.2, in particular proving that keys/values of 1-2 bytes are sorted in the same time as 4-byte ones. + +x64 binary: +``` +GeForce GTX 560 Ti, CC 2.1. VRAM 1.0 GB, 2004 MHz * 256-bit = 128 GB/s. 8 SM * 48 alu * 1800 MHz * 2 = 1.38 TFLOPS +Sorting 16M elements: +1/1 : Throughput = 3387.520 MElements/s, Time = 4.953 ms + +1/2 : Throughput = 3699.231 MElements/s, Time = 4.535 ms +2/2 : Throughput = 1847.639 MElements/s, Time = 9.080 ms + +1/4 : Throughput = 3526.162 MElements/s, Time = 4.758 ms +2/4 : Throughput = 1764.991 MElements/s, Time = 9.506 ms +3/4 : Throughput = 1298.442 MElements/s, Time = 12.921 ms +4/4 : Throughput = 918.664 MElements/s, Time = 18.263 ms + +1/8 : Throughput = 1976.347 MElements/s, Time = 8.489 ms +2/8 : Throughput = 987.337 MElements/s, Time = 16.992 ms +3/8 : Throughput = 716.559 MElements/s, Time = 23.414 ms +4/8 : Throughput = 516.430 MElements/s, Time = 32.487 ms +5/8 : Throughput = 433.722 MElements/s, Time = 38.682 ms +6/8 : Throughput = 355.000 MElements/s, Time = 47.260 ms +7/8 : Throughput = 299.172 MElements/s, Time = 56.079 ms +8/8 : Throughput = 265.991 MElements/s, Time = 63.074 ms + +1/1+1: Throughput = 2351.473 MElements/s, Time = 7.135 ms + +1/1+2: Throughput = 2390.535 MElements/s, Time = 7.018 ms + +1/1+4: Throughput = 2349.824 MElements/s, Time = 7.140 ms + +1/1+8: Throughput = 1707.799 MElements/s, Time = 9.824 ms + +1/2+1: Throughput = 2486.926 MElements/s, Time = 6.746 ms +2/2+1: Throughput = 1242.892 MElements/s, Time = 13.499 ms + +1/2+2: Throughput = 2588.670 MElements/s, Time = 6.481 ms +2/2+2: Throughput = 1293.616 MElements/s, Time = 12.969 ms + +1/2+4: Throughput = 2494.559 MElements/s, Time = 6.726 ms +2/2+4: Throughput = 1247.816 MElements/s, Time = 13.445 ms + +1/2+8: Throughput = 1820.565 MElements/s, Time = 9.215 ms +2/2+8: Throughput = 907.708 MElements/s, Time = 18.483 ms + +1/4+1: Throughput = 2362.225 MElements/s, Time = 7.102 ms +2/4+1: Throughput = 1178.828 MElements/s, Time = 14.232 ms +3/4+1: Throughput = 857.112 MElements/s, Time = 19.574 ms +4/4+1: Throughput = 611.379 MElements/s, Time = 27.442 ms + +1/4+2: Throughput = 2405.528 MElements/s, Time = 6.974 ms +2/4+2: Throughput = 1198.387 MElements/s, Time = 14.000 ms +3/4+2: Throughput = 889.536 MElements/s, Time = 18.861 ms +4/4+2: Throughput = 620.639 MElements/s, Time = 27.032 ms + +1/4+4: Throughput = 2352.615 MElements/s, Time = 7.131 ms +2/4+4: Throughput = 1172.716 MElements/s, Time = 14.306 ms +3/4+4: Throughput = 868.375 MElements/s, Time = 19.320 ms +4/4+4: Throughput = 607.483 MElements/s, Time = 27.618 ms + +1/4+8: Throughput = 1736.340 MElements/s, Time = 9.662 ms +2/4+8: Throughput = 865.921 MElements/s, Time = 19.375 ms +3/4+8: Throughput = 575.488 MElements/s, Time = 29.153 ms +4/4+8: Throughput = 425.810 MElements/s, Time = 39.401 ms + +1/8+1: Throughput = 1422.756 MElements/s, Time = 11.792 ms +2/8+1: Throughput = 710.909 MElements/s, Time = 23.600 ms +3/8+1: Throughput = 497.326 MElements/s, Time = 33.735 ms +4/8+1: Throughput = 365.840 MElements/s, Time = 45.859 ms +5/8+1: Throughput = 301.847 MElements/s, Time = 55.582 ms +6/8+1: Throughput = 249.042 MElements/s, Time = 67.367 ms +7/8+1: Throughput = 212.418 MElements/s, Time = 78.982 ms +8/8+1: Throughput = 188.121 MElements/s, Time = 89.183 ms + +1/8+2: Throughput = 1477.218 MElements/s, Time = 11.357 ms +2/8+2: Throughput = 737.024 MElements/s, Time = 22.763 ms +3/8+2: Throughput = 513.750 MElements/s, Time = 32.656 ms +4/8+2: Throughput = 378.793 MElements/s, Time = 44.291 ms +5/8+2: Throughput = 312.290 MElements/s, Time = 53.723 ms +6/8+2: Throughput = 256.412 MElements/s, Time = 65.431 ms +7/8+2: Throughput = 220.132 MElements/s, Time = 76.214 ms +8/8+2: Throughput = 195.436 MElements/s, Time = 85.845 ms + +1/8+4: Throughput = 1449.801 MElements/s, Time = 11.572 ms +2/8+4: Throughput = 724.099 MElements/s, Time = 23.170 ms +3/8+4: Throughput = 504.848 MElements/s, Time = 33.232 ms +4/8+4: Throughput = 370.575 MElements/s, Time = 45.273 ms +5/8+4: Throughput = 305.673 MElements/s, Time = 54.886 ms +6/8+4: Throughput = 252.235 MElements/s, Time = 66.514 ms +7/8+4: Throughput = 216.072 MElements/s, Time = 77.647 ms +8/8+4: Throughput = 191.243 MElements/s, Time = 87.727 ms + +1/8+8: Throughput = 1253.567 MElements/s, Time = 13.384 ms +2/8+8: Throughput = 626.159 MElements/s, Time = 26.794 ms +3/8+8: Throughput = 414.683 MElements/s, Time = 40.458 ms +4/8+8: Throughput = 312.213 MElements/s, Time = 53.736 ms +5/8+8: Throughput = 253.829 MElements/s, Time = 66.097 ms +6/8+8: Throughput = 212.494 MElements/s, Time = 78.954 ms +7/8+8: Throughput = 182.390 MElements/s, Time = 91.985 ms +8/8+8: Throughput = 159.832 MElements/s, Time = 104.968 ms +``` + +x86 binary: +``` +GeForce GTX 560 Ti, CC 2.1. VRAM 1.0 GB, 2004 MHz * 256-bit = 128 GB/s. 8 SM * 48 alu * 1800 MHz * 2 = 1.38 TFLOPS +Sorting 16M elements: +1/1 : Throughput = 3581.361 MElements/s, Time = 4.685 ms + +1/2 : Throughput = 4001.573 MElements/s, Time = 4.193 ms +2/2 : Throughput = 1998.918 MElements/s, Time = 8.393 ms + +1/4 : Throughput = 3616.995 MElements/s, Time = 4.638 ms +2/4 : Throughput = 1807.320 MElements/s, Time = 9.283 ms +3/4 : Throughput = 1324.084 MElements/s, Time = 12.671 ms +4/4 : Throughput = 936.786 MElements/s, Time = 17.909 ms + +1/8 : Throughput = 2030.784 MElements/s, Time = 8.261 ms +2/8 : Throughput = 1012.451 MElements/s, Time = 16.571 ms +3/8 : Throughput = 728.643 MElements/s, Time = 23.025 ms +4/8 : Throughput = 526.610 MElements/s, Time = 31.859 ms +5/8 : Throughput = 441.815 MElements/s, Time = 37.973 ms +6/8 : Throughput = 361.032 MElements/s, Time = 46.470 ms +7/8 : Throughput = 305.362 MElements/s, Time = 54.942 ms +8/8 : Throughput = 271.851 MElements/s, Time = 61.715 ms + +1/1+1: Throughput = 2386.740 MElements/s, Time = 7.029 ms + +1/1+2: Throughput = 2443.667 MElements/s, Time = 6.866 ms + +1/1+4: Throughput = 2394.756 MElements/s, Time = 7.006 ms + +1/1+8: Throughput = 1745.824 MElements/s, Time = 9.610 ms + +1/2+1: Throughput = 2522.134 MElements/s, Time = 6.652 ms +2/2+1: Throughput = 1257.418 MElements/s, Time = 13.343 ms + +1/2+2: Throughput = 2610.856 MElements/s, Time = 6.426 ms +2/2+2: Throughput = 1301.668 MElements/s, Time = 12.889 ms + +1/2+4: Throughput = 2544.332 MElements/s, Time = 6.594 ms +2/2+4: Throughput = 1269.823 MElements/s, Time = 13.212 ms + +1/2+8: Throughput = 1830.231 MElements/s, Time = 9.167 ms +2/2+8: Throughput = 915.159 MElements/s, Time = 18.333 ms + +1/4+1: Throughput = 2377.789 MElements/s, Time = 7.056 ms +2/4+1: Throughput = 1183.672 MElements/s, Time = 14.174 ms +3/4+1: Throughput = 858.436 MElements/s, Time = 19.544 ms +4/4+1: Throughput = 610.738 MElements/s, Time = 27.470 ms + +1/4+2: Throughput = 2434.079 MElements/s, Time = 6.893 ms +2/4+2: Throughput = 1218.695 MElements/s, Time = 13.767 ms +3/4+2: Throughput = 901.848 MElements/s, Time = 18.603 ms +4/4+2: Throughput = 627.888 MElements/s, Time = 26.720 ms + +1/4+4: Throughput = 2365.656 MElements/s, Time = 7.092 ms +2/4+4: Throughput = 1182.943 MElements/s, Time = 14.183 ms +3/4+4: Throughput = 876.734 MElements/s, Time = 19.136 ms +4/4+4: Throughput = 612.206 MElements/s, Time = 27.405 ms + +1/4+8: Throughput = 1739.089 MElements/s, Time = 9.647 ms +2/4+8: Throughput = 867.370 MElements/s, Time = 19.343 ms +3/4+8: Throughput = 575.998 MElements/s, Time = 29.127 ms +4/4+8: Throughput = 427.041 MElements/s, Time = 39.287 ms + +1/8+1: Throughput = 1457.489 MElements/s, Time = 11.511 ms +2/8+1: Throughput = 727.020 MElements/s, Time = 23.077 ms +3/8+1: Throughput = 510.872 MElements/s, Time = 32.840 ms +4/8+1: Throughput = 375.776 MElements/s, Time = 44.647 ms +5/8+1: Throughput = 310.543 MElements/s, Time = 54.025 ms +6/8+1: Throughput = 255.848 MElements/s, Time = 65.575 ms +7/8+1: Throughput = 218.177 MElements/s, Time = 76.897 ms +8/8+1: Throughput = 193.450 MElements/s, Time = 86.726 ms + +1/8+2: Throughput = 1505.371 MElements/s, Time = 11.145 ms +2/8+2: Throughput = 753.818 MElements/s, Time = 22.256 ms +3/8+2: Throughput = 523.077 MElements/s, Time = 32.074 ms +4/8+2: Throughput = 385.895 MElements/s, Time = 43.476 ms +5/8+2: Throughput = 317.692 MElements/s, Time = 52.810 ms +6/8+2: Throughput = 262.334 MElements/s, Time = 63.954 ms +7/8+2: Throughput = 224.281 MElements/s, Time = 74.804 ms +8/8+2: Throughput = 197.818 MElements/s, Time = 84.811 ms + +1/8+4: Throughput = 1433.343 MElements/s, Time = 11.705 ms +2/8+4: Throughput = 742.258 MElements/s, Time = 22.603 ms +3/8+4: Throughput = 513.536 MElements/s, Time = 32.670 ms +4/8+4: Throughput = 375.606 MElements/s, Time = 44.667 ms +5/8+4: Throughput = 309.324 MElements/s, Time = 54.238 ms +6/8+4: Throughput = 256.144 MElements/s, Time = 65.499 ms +7/8+4: Throughput = 219.808 MElements/s, Time = 76.327 ms +8/8+4: Throughput = 194.048 MElements/s, Time = 86.459 ms + +1/8+8: Throughput = 1264.290 MElements/s, Time = 13.270 ms +2/8+8: Throughput = 632.041 MElements/s, Time = 26.544 ms +3/8+8: Throughput = 421.752 MElements/s, Time = 39.780 ms +4/8+8: Throughput = 313.890 MElements/s, Time = 53.449 ms +5/8+8: Throughput = 255.259 MElements/s, Time = 65.726 ms +6/8+8: Throughput = 213.459 MElements/s, Time = 78.597 ms +7/8+8: Throughput = 183.026 MElements/s, Time = 91.666 ms +8/8+8: Throughput = 159.962 MElements/s, Time = 104.883 ms +```