CUTLASS 2.9 #549

hwu36 · 2022-07-01T01:33:08Z

hwu36
Jul 1, 2022
Maintainer

CUTLASS 2.9.1 is finally tagged. Compared with 2.8, many new features are added in this new release which are listed in the CHANGELOG.md. We discussed them with a little more details.

We expand our existence in HPC by supporting a series of BLAS3 kernels: SYRK, HERK, SYR2K, HER2K, SYMM, TRMM. The new but simple algorithms we used is up to 7x faster than the previous state of the art. They come with all kinds of data types: f32, cf32, f64, cf64, tf32x3, complex tf32x3. These kernels are all supported in the profiler.
Small alignment implicit gemm support for Fprop/Dgrad/Wgrad so that padding channels to the multiple of 128bit is no longer required to use tensor cores. It is developed by @mengchihe from the community.
We added first layer convolution kernels which is useful when the input channel size is tiny. These new kernels are faster than using the above small alignment kernels in these cases. It has two variants: Fixed channels requires the alignment to be the same as the input channel; Few channels requires the input channel is the multiple of the alignment. These kernels are supported in the profiler too, just use fixed_channels or few_channels in the kernel name.
We had a initial python support in a SDK example which includes Python runtime and JIT compilation. So far, we support GEMM. We will make the interface better and add convolution soon.
GEMM + Softmax example is added which is essential for Transformer. It fused the partial max inside the epilogue of the previous GEMM. Two separate kernels are used to do the remaining work. We will keep improving this example.
Gather and Scatter Fusion with GEMM can gather inputs and scatters outputs based on indices vectors in the same GEMM kernel. It can be considered as a fused kernel or a new way to do sparse GEMM. Basically,
- It can select random rows in a row major matrix.
- It can select random columns in a column major matrix.
Back-to-back GEMM/CONV fully supports buffering the first GEMM/CONV results in the shared memory for the 2nd one to use. It can eliminate register spill when the tile size is big. It can also allow users to have more tile sizes to choose. Additionally, bias vector add is supported in the first GEMM/CONV. These two enhancements are
- Supported in kernels: GEMM and CONV.
- Supported in types: fp16 and int8.
- Supported in architectures: Turing and Ampere.
Transposed Convolution (a.k.a Deconvolution) is supported which reuses Dgrad implementation. This is developped by @masahi from the TVM community. Remind that CUTLASS strided dgrad is super fast.
Utility functions that can pad NHWC and convert between NCHW and NHWC.
Many Epilogue enhancement. Big thanks to @antinucleon and @masahi
- Eliminate bank conflicts in int8 tensor core kernels.
- Half2 usage if epilogue compute type is fp16.
- More activation functions: Silu, Hardswish, Leaky Relu.
- New elementwise fusion pattern for residual block.
Group GEMM has a bug fix which now can calculate the thread block number correctly to fully saturate the GPU. The fix impacts the kernels that have occupancy larger than 2 Threadblocks/SM. We observe up to 30% performance speedup in these cases. So far, group gemm has great success in NLP, rankings, HPC, and GNN.
Parallel GEMM splitk is supported in the CUTLASS profiler now. This is implemented by @Peter9606 from the community.
A performance plot of CUTLASS vs. cuDNN implicit gemm is added into README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUTLASS 2.9 #549

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

CUTLASS 2.9 #549

hwu36 Jul 1, 2022 Maintainer

Replies: 0 comments

hwu36
Jul 1, 2022
Maintainer