Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

CUB 1.1.1

Compare
Choose a tag to compare
@brycelelbach brycelelbach released this 19 May 07:32

Summary

CUB 1.1.1 introduces texture and cache modifier iterators, descending sorting, cub::DeviceSelect, cub::DevicePartition, cub::Shuffle*, and cub::MaxSMOccupancy. Additionally, scan and sort performance for older GPUs has been improved and many bugs have been fixed.

Breaking Changes

  • Refactored block-wide I/O (cub::BlockLoad and cub::BlockStore), removing cache-modifiers from their interfaces. cub::CacheModifiedInputIterator and cub::CacheModifiedOutputIterator should now be used with cub::BlockLoad and cub::BlockStore to effect that behavior.

New Features

  • cub::TexObjInputIterator, cub::TexRefInputIterator, cub::CacheModifiedInputIterator, and cub::CacheModifiedOutputIterator types for loading & storing arbitrary types through the cache hierarchy. They are compatible with Thrust.
  • Descending sorting for cub::DeviceRadixSort and cub::BlockRadixSort.
  • Min, max, arg-min, and arg-max operators for cub::DeviceReduce.
  • cub::DeviceSelect (select-unique, select-if, and select-flagged).
  • cub::DevicePartition (partition-if, partition-flagged).
  • Generic cub::ShuffleUp, cub::ShuffleDown, and cub::ShuffleIndex for warp-wide communication of arbitrary data types (SM3x and up).
  • cub::MaxSmOccupancy for accurately determining SM occupancy for any given kernel function pointer.

Other Enhancements

  • Improved cub::DeviceScan and cub::DeviceRadixSort performance for older GPUs (SM1x to SM3x).
  • Renamed device-wide stream_synchronous param to debug_synchronous to avoid confusion about usage.
  • Documentation improvements:
    • Added simple examples of device-wide methods.
    • Improved doxygen documentation and example snippets.
  • Improved test coverege to include up to 21,000 kernel variants and 851,000 unit tests (per architecture, per platform).

Bug Fixes

  • Fix misc `cub::DeviceScan, BlockScan, DeviceReduce, and BlockReduce bugs when operating on non-primitive types for older architectures SM1x.
  • SHFL-based scans and reductions produced incorrect results for multi-word types (size > 4B) on Linux.
  • For cub::WarpScan-based scans, not all threads in the first warp were entering the prefix callback functor.
  • cub::DeviceRadixSort had a race condition with key-value pairs for pre-SM35 architectures.
  • cub::DeviceRadixSor bitfield-extract behavior with long keys on 64-bit Linux was incorrect.
  • cub::BlockDiscontinuity failed to compile for types other than int32_t/uint32_t.
  • CUDA Dynamic Parallelism (CDP, e.g. device-callable) versions of device-wide methods now report the same temporary storage allocation size requirement as their host-callable counterparts.