How to correctly use conv2d split-k parallel #401

masahi · 2022-01-24T19:05:57Z

masahi
Jan 24, 2022

Hi, I modified the ampere_wgrad_mainloop_fusion.cu example to test wgrad with split-k parallel, the code is here https://github.com/masahi/cutlass/blob/example-wgrad-splitk/examples/26_ampere_wgrad_mainloop_fusion/ampere_wgrad_mainloop_fusion.cu

When split_k_mode = kParallel, the outputs are incorrect at the first three elements (they are 0), all other elements match with the reference. Is there something wrong in my code? Should I tweak the parameter in https://github.com/masahi/cutlass/blob/example-wgrad-splitk/examples/26_ampere_wgrad_mainloop_fusion/ampere_wgrad_mainloop_fusion.cu#L99? The code is copied from

cutlass/test/unit/conv/device/conv2d_testbed.h

Lines 70 to 81 in 1ac4559

    
           /// Reduction kernel 
        
           using ReductionOp = cutlass::reduction::thread::ReduceAdd< 
        
             ElementAccumulator,  
        
             typename EpilogueOutputOp::ElementAccumulator, 
        
             EpilogueOutputOp::kCount 
        
           >; 
        
           using ReductionKernel = cutlass::reduction::kernel::ReduceSplitK< 
        
             cutlass::MatrixShape<4, 32 * EpilogueOutputOp::kCount>, 
        
             EpilogueOutputOp, 
        
             ReductionOp 
        
           >;

Answered by manishucsd

Jan 31, 2022

Hi @masahi , I syned your branch and tried it. It was a small issue. I noticed the Split-k-mode is not set for Conv::Arguments

  typename ImplicitGemm::Arguments arguments{
    problem_size,
      tensor_a.device_ref(),
      tensor_b.device_ref(),
      tensor_c.device_ref(),
      tensor_d.device_ref(),
      {options.alpha, options.beta},
      split_k_mode                                // set split_k_mode to parallel. The ctor takes serial as a default argument.
      };

The full diff here:

(example-wgrad-splitk) $ git diff
diff --git a/examples/26_ampere_wgrad_mainloop_fusion/ampere_wgrad_mainloop_fusion.cu b/examples/26_ampere_wgrad_mainloop_fusion/ampere_wgrad_mainloop_fusion.cu
i…

View full answer

hwu36 · 2022-01-27T04:36:17Z

hwu36
Jan 27, 2022
Maintainer

I am looking at this now. What is the command to run your test?

0 replies

masahi · 2022-01-27T05:10:56Z

masahi
Jan 27, 2022
Author

I just modify and compile 26_ampere_wgrad_mainloop_fusion and run without arguments.

0 replies

manishucsd · 2022-01-31T05:13:12Z

manishucsd
Jan 31, 2022

Hi @masahi , I syned your branch and tried it. It was a small issue. I noticed the Split-k-mode is not set for Conv::Arguments

  typename ImplicitGemm::Arguments arguments{
    problem_size,
      tensor_a.device_ref(),
      tensor_b.device_ref(),
      tensor_c.device_ref(),
      tensor_d.device_ref(),
      {options.alpha, options.beta},
      split_k_mode                                // set split_k_mode to parallel. The ctor takes serial as a default argument.
      };

The full diff here:

(example-wgrad-splitk) $ git diff
diff --git a/examples/26_ampere_wgrad_mainloop_fusion/ampere_wgrad_mainloop_fusion.cu b/examples/26_ampere_wgrad_mainloop_fusion/ampere_wgrad_mainloop_fusion.cu
index 61c8fff..cbc68ef 100644
--- a/examples/26_ampere_wgrad_mainloop_fusion/ampere_wgrad_mainloop_fusion.cu
+++ b/examples/26_ampere_wgrad_mainloop_fusion/ampere_wgrad_mainloop_fusion.cu
@@ -257,8 +257,8 @@ struct Options {
        << "  --tag <string>       String to replicate across the first column in the results table\n";
 
     out << "\n\nExamples:\n\n"
-       << "$ ./examples/26_ampere_fused_fprop_batch_normalization/26_ampere_fused_wgrad_batch_normalization  --n=32 --h=224 --w=224 --c=128 --k=256 --r=1 --s=1\n\n"
-       << "$ ./examples/26_ampere_fused_fprop_batch_normalization/26_ampere_fused_wgrad_batch_normalization  --n=1 --h=224 --w=224 --c=32 --k=32 --r=3 --s=3 --ref-check\n\n";
+       << "$ ./examples/26_ampere_fused_wgrad_batch_normalization/26_ampere_fused_wgrad_batch_normalization  --n=32 --h=224 --w=224 --c=128 --k=256 --r=1 --s=1\n\n"
+       << "$ ./examples/26_ampere_fused_wgrad_batch_normalization/26_ampere_fused_wgrad_batch_normalization  --n=1 --h=224 --w=224 --c=32 --k=32 --r=3 --s=3 --ref-check\n\n";
 
     return out;
   }
@@ -418,6 +418,7 @@ Result profile_convolution(Options const &options) {
       tensor_c.device_ref(),
       tensor_d.device_ref(),
       {options.alpha, options.beta},
+      split_k_mode
       };
 
   //

Example runs after the fix:

$ ./examples/26_ampere_wgrad_mainloop_fusion/26_ampere_wgrad_mainloop_fusion 
Verification on device...
Passed.
Layer,N,H,W,C,K,R,S,Stride_H,Stride_W,Runtime,GFLOPs
conv_1,1,32,32,32,32,3,3,1,1,0,0

Benchmarking runs after the fix:

./examples/26_ampere_wgrad_mainloop_fusion/26_ampere_wgrad_mainloop_
fusion --benchmark
Layer,N,H,W,C,K,R,S,Stride_H,Stride_W,Runtime,GFLOPs
Verification on device...
Passed.
conv_1,34,56,56,64,256,1,1,1,1,0,0
Verification on device...
Passed.
conv_1,408,56,56,64,256,1,1,1,1,0,0
Verification on device...
Passed.
conv_2,34,56,56,64,64,1,1,1,1,0,0
Verification on device...
Passed.
conv_2,408,56,56,64,64,1,1,1,1,0,0
Verification on device...
Passed.
conv_3,34,56,56,64,64,3,3,1,1,0,0
Verification on device...
Passed.
conv_3,408,56,56,64,64,3,3,1,1,0,0
Verification on device...
Passed.
conv_4,34,56,56,256,64,1,1,1,1,0,0
Verification on device...
Passed.
conv_4,408,56,56,256,64,1,1,1,1,0,0
Verification on device...
Passed.
conv_5,34,56,56,256,512,1,1,2,2,0,0
Verification on device...
Passed.
conv_5,408,56,56,256,512,1,1,2,2,0,0
Verification on device...
Passed.
conv_6,34,56,56,256,128,1,1,1,1,0,0
Verification on device...
Passed.
conv_6,408,56,56,256,128,1,1,1,1,0,0
Verification on device...
Passed.
conv_7,34,56,56,128,128,3,3,2,2,0,0
Verification on device...
Passed.
conv_7,408,56,56,128,128,3,3,2,2,0,0
Verification on device...
Passed.
conv_8,34,28,28,128,512,1,1,1,1,0,0
Verification on device...
Passed.
conv_8,408,28,28,128,512,1,1,1,1,0,0
Verification on device...
Passed.
conv_9,34,28,28,512,128,1,1,1,1,0,0
Verification on device...
Passed.
conv_9,408,28,28,512,128,1,1,1,1,0,0
Verification on device...
Passed.
conv_10,34,28,28,128,128,3,3,1,1,0,0
Verification on device...
Passed.
conv_10,408,28,28,128,128,3,3,1,1,0,0
Verification on device...
Passed.
conv_11,34,28,28,512,1024,1,1,2,2,0,0
Verification on device...
Passed.
conv_11,408,28,28,512,1024,1,1,2,2,0,0
Verification on device...
Passed.
conv_12,34,28,28,512,256,1,1,1,1,0,0
Verification on device...
Passed.
conv_12,408,28,28,512,256,1,1,1,1,0,0
Verification on device...
Passed.
conv_13,34,28,28,256,256,3,3,2,2,0,0
Verification on device...
Passed.
conv_13,408,28,28,256,256,3,3,2,2,0,0
Verification on device...
Passed.
conv_14,34,14,14,256,1024,1,1,1,1,0,0
Verification on device...
Passed.
conv_14,408,14,14,256,1024,1,1,1,1,0,0
Verification on device...
Passed.
conv_15,34,14,14,1024,256,1,1,1,1,0,0
Verification on device...
Passed.
conv_15,408,14,14,1024,256,1,1,1,1,0,0
Verification on device...
Passed.
conv_16,34,14,14,256,256,3,3,1,1,0,0
Verification on device...
Passed.
conv_16,408,14,14,256,256,3,3,1,1,0,0
Verification on device...
Passed.
conv_17,34,14,14,1024,2048,1,1,2,2,0,0
Verification on device...
Passed.
conv_17,408,14,14,1024,2048,1,1,2,2,0,0
Verification on device...
Passed.
conv_18,34,14,14,1024,512,1,1,1,1,0,0
Verification on device...
Passed.
conv_18,408,14,14,1024,512,1,1,1,1,0,0
Verification on device...
Passed.
conv_19,34,14,14,512,512,3,3,2,2,0,0
Verification on device...
Passed.
conv_19,408,14,14,512,512,3,3,2,2,0,0
Verification on device...
Passed.
conv_20,34,7,7,512,2048,1,1,1,1,0,0
Verification on device...
Passed.
conv_20,408,7,7,512,2048,1,1,1,1,0,0
Verification on device...
Passed.
conv_21,34,7,7,2048,512,1,1,1,1,0,0
Verification on device...
Passed.
conv_21,408,7,7,2048,512,1,1,1,1,0,0
Verification on device...
Passed.
conv_22,34,7,7,512,512,3,3,1,1,0,0
Verification on device...
Passed.
conv_22,408,7,7,512,512,3,3,1,1,0,0

1 reply

masahi Jan 31, 2022
Author

Awesome! I wouldn't find this myself :)

hwu36 · 2022-01-31T05:27:41Z

hwu36
Jan 31, 2022
Maintainer

@masahi , there is one thing special about fp16 in, fp32 accum, fp16 out parallel split-k wgrad. The conv kernel needs to use fp32 accumulation, fp32 output. The reduction kernel reads in fp32, and accum in fp32 and output in fp16. Otherwise, you will have precision problem.

3 replies

masahi Jan 31, 2022
Author

Great, I'll keep that in mind. I wonder how much storing fp32 intermediate would hurt perf compared to the gain from parallel reduction.

masahi Feb 2, 2022
Author

I struggled a bit in getting compilation working in that setup. The key was to create two Epilogue ops, the original and dummy one with different output type, namely f32. The original one is used to instantiate the reduction op, and the dummy one for the gemm with fp32 temp result.

The compile error was not helpful in figuring out that using the original epilogue with fp16 output during GEMM was wrong :)

masahi Feb 2, 2022
Author

Actually, getting this to work without allocating a temporary fp32 buffer, like below, was highly non trivial.

https://github.com/masahi/cutlass/blob/example-wgrad-splitk/examples/26_ampere_wgrad_mainloop_fusion/ampere_wgrad_mainloop_fusion.cu#L426-L427

I updated my code in my branch, in particular the trick below which resets the tensor D before calling implicit_gemm.initialize(...) was not obvious.

https://github.com/masahi/cutlass/blob/example-wgrad-splitk/examples/26_ampere_wgrad_mainloop_fusion/ampere_wgrad_mainloop_fusion.cu#L448-L451

hwu36 · 2022-02-02T17:10:26Z

hwu36
Feb 2, 2022
Maintainer

I think you can get better perf if you change https://github.com/masahi/cutlass/blob/example-wgrad-splitk/examples/26_ampere_wgrad_mainloop_fusion/ampere_wgrad_mainloop_fusion.cu#L72

to

128 / cutlass::sizeof_bits<float>::value

It will let threads use 128bit load to load consecutive data. If you use 128 / cutlass::sizeof_bits<fp16>::value, every thread will load 256 bit data in two load instructions since the max load width is 128bit. Then, loads from neighboring threads are not coalesced.

1 reply

masahi Feb 2, 2022
Author

Yes that was a mistake after changing ElementOutput to 16 bit. May be I thought it was for vecotrized storing.

manishucsd · 2022-02-02T18:15:23Z

manishucsd
Feb 2, 2022

Hi Masa,

For the below data type specification:

using ElementAccumulator = float;                   // Data type of accumulator
using ElementComputeEpilogue = float;              // Data type of epilogue computation (alpha, beta)
using ElementInputA = cutlass::half_t;             // Data type of elements in input tensor
using ElementInputB = cutlass::half_t;             // Data type of elements in input tensor
using ElementOutput = cutlass::half_t;            // Data type of elements in output tensor
using ElementC = ElementOutput;

I don't think you need two reduction kernels. The above data type means that Wgrad accumulates in ElementAccumulator = F32, but writes the final output in ElementOutput = F16.

Without Parallel Split-k Output
The Wgrad writes the output location (ref_D) in F16.

Dw_F16 <- ElementOutput(alpha * Wgrad(Dy_F16, Dw_F16) + beta * (Dw_F16))

The term Wgrad(Dy_F16, Dw_F16) is for accumulator registers after Wgrad Op on inputs. The accumulators are in F32.

With Parallel Split-k
The final output is now written in by the reduction kernel (Dw_F16) and the Wgrad kernel needs to write to workspace_F32. If the workspace is written in F16 the accumulation for the entire GEMM-K is not happening F32. It will mean that within the split-k-slice chunk we do F32 accumulation, but we accumulate split-k-chunks in F16. This is not the data type specification the user asked for.

Thus, you need to call a different Wgrad operator (let us call it WgradForParallelSplitK) and a ParallelReduction kernel:

workspace_F132 <- ElementAccumulator(1.0 * WgradForParallelSplitK(Dy_F16, Dw_F16) + 0.0 * Dw_F16)
Dw_F16 <- ElementOutput(alpha * ParallelReduction(workspace_F32) + beta * Dw_F16)

Summary:
Thus, instead of new two reduction, to execute in the right accuracies with parallel splitk you will need a different Wgrad and not two reduction kernels for the above data type specification.

2 replies

masahi Feb 2, 2022
Author

we accumulate split-k-chunks in F16

Is that really so? From my read, I was assuming that the final reduction is happening at the accumulation data type (ElementAccumulator or ElementEpioglueCompute, forgot which), epilogue computation is still in fp32, and fp32 -> fp16 conversion happens at the end of epilogue.

manishucsd Feb 2, 2022

ok. Yes. Just checked that you have Wgrad output F32 into workspace: https://github.com/masahi/cutlass/blob/example-wgrad-splitk/examples/26_ampere_wgrad_mainloop_fusion/ampere_wgrad_mainloop_fusion.cu#L81

So that part is correct. Just make sure the last comment I posted is followed in your code for Wgrad parallel split-k with (ElementAccumulator=F32 and ElementOutput=F16)

masahi · 2022-02-03T09:46:00Z

masahi
Feb 3, 2022
Author

ok I got code generation for wgrad + parallel split k (with hard coded split_k_slices) + fp32 accum, fp16 output working. I'm comparing the outputs against cuDNN and found some non trivial difference, while for dgrad results were identical.

For each workload, I dumped maximum and mean absolute difference between cutlass and cudnn results, summarized here https://gist.github.com/masahi/c75b2dc806167c77c9c6ca1fb160194b. Is having some difference expected, or should they be identical (as in dgrad case)?

1 reply

masahi Feb 6, 2022
Author

Sorry please ignore my above post, cutlass profiler reports that the results match with cuDNN. I need to review the TVM-generated code more carefully.

See #396 (reply in thread)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to correctly use conv2d split-k parallel #401

{{title}}

Replies: 7 comments 8 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

How to correctly use conv2d split-k parallel #401

masahi Jan 24, 2022

Replies: 7 comments · 8 replies

hwu36 Jan 27, 2022 Maintainer

masahi Jan 27, 2022 Author

manishucsd Jan 31, 2022

masahi Jan 31, 2022 Author

hwu36 Jan 31, 2022 Maintainer

masahi Jan 31, 2022 Author

masahi Feb 2, 2022 Author

masahi Feb 2, 2022 Author

hwu36 Feb 2, 2022 Maintainer

masahi Feb 2, 2022 Author

manishucsd Feb 2, 2022

masahi Feb 2, 2022 Author

manishucsd Feb 2, 2022

masahi Feb 3, 2022 Author

masahi Feb 6, 2022 Author

masahi
Jan 24, 2022

Replies: 7 comments 8 replies

hwu36
Jan 27, 2022
Maintainer

masahi
Jan 27, 2022
Author

manishucsd
Jan 31, 2022

masahi Jan 31, 2022
Author

hwu36
Jan 31, 2022
Maintainer

masahi Jan 31, 2022
Author

masahi Feb 2, 2022
Author

masahi Feb 2, 2022
Author

hwu36
Feb 2, 2022
Maintainer

masahi Feb 2, 2022
Author

manishucsd
Feb 2, 2022

masahi Feb 2, 2022
Author

masahi
Feb 3, 2022
Author

masahi Feb 6, 2022
Author