Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

CUB 1.3.0

Compare
Choose a tag to compare
@brycelelbach brycelelbach released this 19 May 08:27

Summary

CUB 1.3.0 improves how thread blocks are expressed in block- and warp-wide primitives and adds an enhanced version of cub::WarpScan.

Breaking Changes

  • CUB's collective (block-wide, warp-wide) primitives underwent a minor interface refactoring:
    • To provide the appropriate support for multidimensional thread blocks, The interfaces for collective classes are now template-parameterized by X, Y, and Z block dimensions (with BLOCK_DIM_Y and BLOCK_DIM_Z being optional, and BLOCK_DIM_X replacing BLOCK_THREADS). Furthermore, the constructors that accept remapped linear thread-identifiers have been removed: all primitives now assume a row-major thread-ranking for multidimensional thread blocks.
    • To allow the host program (compiled by the host-pass) to accurately determine the device-specific storage requirements for a given collective (compiled for each device-pass), the interfaces for collective classes are now (optionally) template-parameterized by the desired PTX compute capability. This is useful when aliasing collective storage to shared memory that has been allocated dynamically by the host at the kernel call site.
    • Most CUB programs having typical 1D usage should not require any changes to accomodate these updates.

New Features

  • Added "combination" cub::WarpScan methods for efficiently computing both inclusive and exclusive prefix scans (and sums).

Bug Fixes

  • Fix for bug in cub::WarpScan (which affected cub::BlockScan and cub::DeviceScan) where incorrect results (e.g., NAN) would often be returned when parameterized for floating-point types (fp32, fp64).
  • Workaround for ptxas error when compiling with with -G flag on Linux (for debug instrumentation).
  • Fixes for certain scan scenarios using custom scan operators where code compiled for SM1x is run on newer GPUs of higher compute-capability: the compiler could not tell which memory space was being used collective operations and was mistakenly using global ops instead of shared ops.