CUB 1.15.0
Summary
CUB 1.15.0 includes a new cub::DeviceSegmentedSort
algorithm, which demonstrates up to 5000x speedup compared to cub::DeviceSegmentedRadixSort
when sorting a large number of small segments. A new cub::FutureValue<T>
helper allows the cub::DeviceScan
algorithms to lazily load the initial_value
from a pointer. cub::DeviceScan
also added ScanByKey
functionality.
The new DeviceSegmentedSort
algorithm partitions segments into size groups. Each group is processed with specialized kernels using a variety of sorting algorithms. This approach varies the number of threads allocated for sorting each segment and utilizes the GPU more efficiently.
cub::FutureValue<T>
provides the ability to use the result of a previous kernel as a scalar input to a CUB device-scope algorithm without unnecessary synchronization:
int *d_intermediate_result = ...;
intermediate_kernel<<<blocks, threads>>>(d_intermediate_result, // output
arg1, // input
arg2); // input
// Wrap the intermediate pointer in a FutureValue -- no need to explicitly
// sync when both kernels are stream-ordered. The pointer is read after
// the ExclusiveScan kernel starts executing.
cub::FutureValue<int> init_value(d_intermediate_result);
cub::DeviceScan::ExclusiveScan(d_temp_storage,
temp_storage_bytes,
d_in,
d_out,
cub::Sum(),
init_value,
num_items);
Previously, an explicit synchronization would have been necessary to obtain the intermediate result, which was passed by value into ExclusiveScan. This new feature enables better performance in workflows that use cub::DeviceScan.
Deprecation Notices
A future version of CUB will change the debug_synchronous
behavior of device-scope algorithms when invoked via CUDA Dynamic Parallelism (CDP).
This will only affect calls to CUB device-scope algorithms launched from device-side code with debug_synchronous = true
. These algorithms will continue to print extra debugging information, but they will no longer synchronize after kernel launches.
Breaking Changes
- #305: The template parameters of
cub::DispatchScan
have changed to support the newcub::FutureValue
helper. More details under "New Features". - #377: Remove broken
operator->()
fromcub::TransformInputIterator
, since this cannot be implemented without returning a temporary object's address. Thanks to Xiang Gao (@zasdfgbnm) for this contribution.
New Features
- #305: Add overloads to
cub::DeviceScan
algorithms that allow the output of a previous kernel to be used asinitial_value
without explicit synchronization. See the newcub::FutureValue
helper for details. Thanks to Xiang Gao (@zasdfgbnm) for this contribution. - #354: Add
cub::BlockRunLengthDecode
algorithm. Thanks to Elias Stehle (@elstehle) for this contribution. - #357: Add
cub::DeviceSegmentedSort
, an optimized version ofcub::DeviceSegmentedSort
with improved load balancing and small array performance. - #376: Add "by key" overloads to
cub::DeviceScan
. Thanks to Xiang Gao (@zasdfgbnm) for this contribution.
Bug Fixes
- #349: Doxygen and unused variable fixes.
- #363: Maintenance updates for the new
cub::DeviceMergeSort
algorithms. - #382: Fix several
-Wconversion
warnings. Thanks to Matt Stack (@matt-stack) for this contribution. - #388: Fix debug assertion on MSVC when using
cub::CachingDeviceAllocator
. - #395: Support building with
__CUDA_NO_HALF_CONVERSIONS__
. Thanks to Xiang Gao (@zasdfgbnm) for this contribution.