How to correctly use conv2d split-k parallel #401
-
Hi, I modified the When cutlass/test/unit/conv/device/conv2d_testbed.h Lines 70 to 81 in 1ac4559 |
Beta Was this translation helpful? Give feedback.
Replies: 7 comments 8 replies
-
I am looking at this now. What is the command to run your test? |
Beta Was this translation helpful? Give feedback.
-
I just modify and compile |
Beta Was this translation helpful? Give feedback.
-
Hi @masahi , I syned your branch and tried it. It was a small issue. I noticed the Split-k-mode is not set for Conv::Arguments typename ImplicitGemm::Arguments arguments{
problem_size,
tensor_a.device_ref(),
tensor_b.device_ref(),
tensor_c.device_ref(),
tensor_d.device_ref(),
{options.alpha, options.beta},
split_k_mode // set split_k_mode to parallel. The ctor takes serial as a default argument.
}; The full diff here:
Example runs after the fix:
Benchmarking runs after the fix:
|
Beta Was this translation helpful? Give feedback.
-
@masahi , there is one thing special about fp16 in, fp32 accum, fp16 out parallel split-k wgrad. The conv kernel needs to use fp32 accumulation, fp32 output. The reduction kernel reads in fp32, and accum in fp32 and output in fp16. Otherwise, you will have precision problem. |
Beta Was this translation helpful? Give feedback.
-
I think you can get better perf if you change https://github.com/masahi/cutlass/blob/example-wgrad-splitk/examples/26_ampere_wgrad_mainloop_fusion/ampere_wgrad_mainloop_fusion.cu#L72 to
It will let threads use 128bit load to load consecutive data. If you use |
Beta Was this translation helpful? Give feedback.
-
Hi Masa, For the below data type specification:
I don't think you need two reduction kernels. The above data type means that Wgrad accumulates in ElementAccumulator = F32, but writes the final output in ElementOutput = F16. Without Parallel Split-k Output
The term Wgrad(Dy_F16, Dw_F16) is for accumulator registers after Wgrad Op on inputs. The accumulators are in F32. With Parallel Split-k Thus, you need to call a different Wgrad operator (let us call it WgradForParallelSplitK) and a ParallelReduction kernel:
Summary: |
Beta Was this translation helpful? Give feedback.
-
ok I got code generation for wgrad + parallel split k (with hard coded For each workload, I dumped maximum and mean absolute difference between cutlass and cudnn results, summarized here https://gist.github.com/masahi/c75b2dc806167c77c9c6ca1fb160194b. Is having some difference expected, or should they be identical (as in dgrad case)? |
Beta Was this translation helpful? Give feedback.
Hi @masahi , I syned your branch and tried it. It was a small issue. I noticed the Split-k-mode is not set for Conv::Arguments
The full diff here: