Performance gap of same kernel in different cuda version? #850
Unanswered
LeiWang1999
asked this question in
Q&A
Replies: 1 comment 4 replies
-
we work together closely with nvcc team to improve the performance of cutlass. Every version of nvcc usually improves some type of cutlass kernels. As to gemm, 11.3 is the minimum. We recommend to use the latest nvcc since it has the most latest optimizations. NVCC also enables newer HW features in different versions and cutlass will use these features in the kernels when they are available in nvcc. |
Beta Was this translation helpful? Give feedback.
4 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi all, I'm benchmarking cutlass with cutlass_profiler on my 24GB gtx 3090, however, I found that the best performance of cutlass under cuda 11.1 and cuda 11.8 not share the same kernel.
in cuda 11.1 the best kernel is
cutlass_tensorop_h1688gemm_256x128_32x2_tt_align8
, in cuda 11.8, the best iscutlass_tensorop_h16816gemm_256x128_32x3_tt_align8
, btw,cutlass_tensorop_h1688gemm_256x128_32x2_tt_align8
maintain same performance with cuda 11.1.But
cutlass_tensorop_h16816gemm_256x128_32x3_tt_align8
in cuda 11.1 has a very bad performance, from my profile, I found that this kernel has so many local memory read and write, which may caused by register spill, maybe cuda 11.8 has better performance is bacause of the nvcc fix some spill case?I aslo noticed that for different cuda versions, cutlass will enable some new features, like l2 cache prefetch or grid_constant, the generated code should be no different other than these features, is my understanding correct?
Beta Was this translation helpful? Give feedback.
All reactions