You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
CUTLASS 2.9.1 is finally tagged. Compared with 2.8, many new features are added in this new release which are listed in the CHANGELOG.md. We discussed them with a little more details.
We expand our existence in HPC by supporting a series of BLAS3 kernels: SYRK, HERK, SYR2K, HER2K, SYMM, TRMM. The new but simple algorithms we used is up to 7x faster than the previous state of the art. They come with all kinds of data types: f32, cf32, f64, cf64, tf32x3, complex tf32x3. These kernels are all supported in the profiler.
Small alignment implicit gemm support for Fprop/Dgrad/Wgrad so that padding channels to the multiple of 128bit is no longer required to use tensor cores. It is developed by @mengchihe from the community.
We added first layer convolution kernels which is useful when the input channel size is tiny. These new kernels are faster than using the above small alignment kernels in these cases. It has two variants: Fixed channels requires the alignment to be the same as the input channel; Few channels requires the input channel is the multiple of the alignment. These kernels are supported in the profiler too, just use fixed_channels or few_channels in the kernel name.
We had a initial python support in a SDK example which includes Python runtime and JIT compilation. So far, we support GEMM. We will make the interface better and add convolution soon.
GEMM + Softmax example is added which is essential for Transformer. It fused the partial max inside the epilogue of the previous GEMM. Two separate kernels are used to do the remaining work. We will keep improving this example.
Gather and Scatter Fusion with GEMM can gather inputs and scatters outputs based on indices vectors in the same GEMM kernel. It can be considered as a fused kernel or a new way to do sparse GEMM. Basically,
It can select random rows in a row major matrix.
It can select random columns in a column major matrix.
Back-to-back GEMM/CONV fully supports buffering the first GEMM/CONV results in the shared memory for the 2nd one to use. It can eliminate register spill when the tile size is big. It can also allow users to have more tile sizes to choose. Additionally, bias vector add is supported in the first GEMM/CONV. These two enhancements are
Supported in kernels: GEMM and CONV.
Supported in types: fp16 and int8.
Supported in architectures: Turing and Ampere.
Transposed Convolution (a.k.a Deconvolution) is supported which reuses Dgrad implementation. This is developped by @masahi from the TVM community. Remind that CUTLASS strided dgrad is super fast.
Utility functions that can pad NHWC and convert between NCHW and NHWC.
Group GEMM has a bug fix which now can calculate the thread block number correctly to fully saturate the GPU. The fix impacts the kernels that have occupancy larger than 2 Threadblocks/SM. We observe up to 30% performance speedup in these cases. So far, group gemm has great success in NLP, rankings, HPC, and GNN.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
CUTLASS 2.9.1 is finally tagged. Compared with 2.8, many new features are added in this new release which are listed in the CHANGELOG.md. We discussed them with a little more details.
fixed_channels
orfew_channels
in the kernel name.Beta Was this translation helpful? Give feedback.
All reactions