This repository has been archived by the owner on Mar 21, 2024. It is now read-only.
CUB 1.1.1
Summary
CUB 1.1.1 introduces texture and cache modifier iterators, descending sorting, cub::DeviceSelect
, cub::DevicePartition
, cub::Shuffle*
, and cub::MaxSMOccupancy
. Additionally, scan and sort performance for older GPUs has been improved and many bugs have been fixed.
Breaking Changes
- Refactored block-wide I/O (
cub::BlockLoad
andcub::BlockStore
), removing cache-modifiers from their interfaces.cub::CacheModifiedInputIterator
andcub::CacheModifiedOutputIterator
should now be used withcub::BlockLoad
andcub::BlockStore
to effect that behavior.
New Features
cub::TexObjInputIterator
,cub::TexRefInputIterator
,cub::CacheModifiedInputIterator
, andcub::CacheModifiedOutputIterator
types for loading & storing arbitrary types through the cache hierarchy. They are compatible with Thrust.- Descending sorting for
cub::DeviceRadixSort
andcub::BlockRadixSort
. - Min, max, arg-min, and arg-max operators for
cub::DeviceReduce
. cub::DeviceSelect
(select-unique, select-if, and select-flagged).cub::DevicePartition
(partition-if, partition-flagged).- Generic
cub::ShuffleUp
,cub::ShuffleDown
, andcub::ShuffleIndex
for warp-wide communication of arbitrary data types (SM3x and up). cub::MaxSmOccupancy
for accurately determining SM occupancy for any given kernel function pointer.
Other Enhancements
- Improved
cub::DeviceScan
andcub::DeviceRadixSort
performance for older GPUs (SM1x to SM3x). - Renamed device-wide
stream_synchronous
param todebug_synchronous
to avoid confusion about usage. - Documentation improvements:
- Added simple examples of device-wide methods.
- Improved doxygen documentation and example snippets.
- Improved test coverege to include up to 21,000 kernel variants and 851,000 unit tests (per architecture, per platform).
Bug Fixes
- Fix misc `cub::DeviceScan, BlockScan, DeviceReduce, and BlockReduce bugs when operating on non-primitive types for older architectures SM1x.
- SHFL-based scans and reductions produced incorrect results for multi-word types (size > 4B) on Linux.
- For
cub::WarpScan
-based scans, not all threads in the first warp were entering the prefix callback functor. cub::DeviceRadixSort
had a race condition with key-value pairs for pre-SM35 architectures.cub::DeviceRadixSor
bitfield-extract behavior with long keys on 64-bit Linux was incorrect.cub::BlockDiscontinuity
failed to compile for types other thanint32_t
/uint32_t
.- CUDA Dynamic Parallelism (CDP, e.g. device-callable) versions of device-wide methods now report the same temporary storage allocation size requirement as their host-callable counterparts.