Releases: NVIDIA/cub
CUB 1.8.0
Summary
CUB 1.8.0 introduces changes to the cub::Shuffle*
interfaces.
Breaking Changes
- The interfaces of
cub::ShuffleIndex
,cub::ShuffleUp
, andcub::ShuffleDown
have been changed to allow for better computation of the PTX SHFL control constant for logical warps smaller than 32 threads.
Bug Fixes
- #112: Fix
cub::WarpScan
's broadcast of warp-wide aggregate for logical warps smaller than 32 threads.
CUB 1.7.5
Summary
CUB 1.7.5 adds support for radix sorting __half
keys and improved sorting performance for 1 byte keys. It was incorporated into Thrust 1.9.2.
Enhancements
- Radix sort support for
__half
keys. - Radix sort tuning policy updates to improve 1 byte key performance.
Bug Fixes
CUB 1.7.4
CUB 1.7.3
CUB 1.7.2
CUB 1.7.1
Summary
CUB 1.7.0 brings support for CUDA 9.0 and SM7x (Volta) GPUs.
It is compatible with independent thread scheduling.
Breaking Changes
- Remove
cub::WarpAll
andcub::WarpAny
. These functions served to emulate__all
and__any
functionality for SM1x devices, which did not have those operations. However, SM1x devices are now deprecated in CUDA, and the interfaces of these two functions are now lacking the lane-mask needed for collectives to run on SM7x and newer GPUs which have independent thread scheduling.
Other Enhancements
- Remove any assumptions of implicit warp synchronization to be compatible with SM7x's (Volta) independent thread scheduling.
Bug Fixes
- #86: Incorrect results with reduce-by-key.
CUB 1.7.0
Summary
CUB 1.7.0 brings support for CUDA 9.0 and SM7x (Volta) GPUs. It is compatible with independent thread scheduling. It was incorporated into Thrust 1.9.2.
Breaking Changes
- Remove
cub::WarpAll
andcub::WarpAny
. These functions served to emulate__all
and__any
functionality for SM1x devices, which did not have those operations. However, SM1x devices are now deprecated in CUDA, and the interfaces of these two functions are now lacking the lane-mask needed for collectives to run on SM7x and newer GPUs which have independent thread scheduling.
Other Enhancements
- Remove any assumptions of implicit warp synchronization to be compatible with SM7x's (Volta) independent thread scheduling.
Bug Fixes
- #86: Incorrect results with reduce-by-key.
CUB 1.6.4
Summary
CUB 1.6.4 improves radix sorting performance for SM5x (Maxwell) and SM6x (Pascal) GPUs.
Enhancements
- Radix sort tuning policies updated for SM5x (Maxwell) and SM6x (Pascal) - 3.5B and 3.4B 32 byte keys/s on TitanX and GTX 1080, respectively.
Bug Fixes
- Restore fence work-around for scan (reduce-by-key, etc.) hangs in CUDA 8.5.
- #65:
cub::DeviceSegmentedRadixSort
should allow inputs to have pointer-to-const type. - Mollify Clang device-side warnings.
- Remove out-dated MSVC project files.
CUB 1.6.3
Summary
CUB 1.6.3 improves support for Windows, changes cub::BlockLoad
/cub::BlockStore
interface to take the local data type, and enhances radix sort performance for SM6x (Pascal) GPUs.
Breaking Changes
cub::BlockLoad
andcub::BlockStore
are now templated by the local data type, instead of theIterator
type. This allows for output iterators havingvoid
as theirvalue_type
(e.g. discard iterators).
Other Enhancements
- Radix sort tuning policies updated for SM6x (Pascal) GPUs - 6.2B 4 byte keys/s on GP100.
- Improved support for Windows (warnings, alignment, etc).
Bug Fixes
- #74:
cub::WarpReduce
executes reduction operator for out-of-bounds items. - #72:
cub:InequalityWrapper::operator
should be non-const. - #71:
cub::KeyValuePair
won't work ifKey
has non-trivial constructor. - #69: cub::BlockStore::Store
doesn't compile if
OutputIteratorT::value_typeisn't
T`. - #68:
cub::TilePrefixCallbackOp::WarpReduce
doesn't permit PTX arch specialization.
CUB 1.6.2 (previously 1.5.5)
Summary
CUB 1.6.2 (previously 1.5.5) improves radix sort performance for SM6x (Pascal) GPUs.
Enhancements
- Radix sort tuning policies updated for SM6x (Pascal) GPUs.
Bug Fixes
- Fix AArch64 compilation of
cub::CachingDeviceAllocator
.