Skip to content

Latest commit

 

History

History
16 lines (12 loc) · 927 Bytes

README.md

File metadata and controls

16 lines (12 loc) · 927 Bytes

This directory contains BSC implementations of Sort Transform on CPU & GPU.

Now CUDA ST4 sorting of 30 MB block requires (on GF560Ti) about 150 ms, of those only 65 ms spent in CUB Radix sort, and copying data in/out spends 12 ms.

Further optimizations:

  • perform cudaMalloc/cudaFree only once - this should double the performance
  • keep data in pinned memory - this shoud double the cudaMemcpy speed and give another 5-10% performance boost

Use some combination of the following ideas to shave off remaining times over 65 ms

  • overload pre/post-sorting procedures and RLE compression with memcpy
  • process only 4-byte elements at last sorting stages, and simultaneously copy-in next block to process - 4-byte sorting should also be faster than sorting of 4+4 (key+value) bytes (43 ms total instead of 65 ms)
  • use zero-copy memory instead of copy in/out

So, after all optimizations, ST4 should become more than 3x faster!