[LLVMGPUVectorDistribute] Add support for inter-subgroup multi_reduction #19596

manupak · 2025-01-03T17:38:00Z

This commit adds support for distribute multi_reductions where the reduction dimension(s) is/are distributed across subgroups.

We perform the existing reduction distribution, however, we are left with partial reductions accross subgroups.

Thereafter, we insert tranfer_write / transfer_read to shared memory to achieve a layout change where
we re-distribute reduction subgroup tiles into element tile. Finally, we do another multi_reduction to complete the reduction.

closes: #19578

manupak · 2025-01-03T17:39:54Z

Im putting this to draft as I have not yet numerically verified these changes.

This PR basically implements approach 3 of https://hackmd.io/bXwT715pQU-8rrS78LKCbw

compiler/src/iree/compiler/Codegen/Common/GPU/GPUVectorDistribution.cpp

This commit adds support for distribute multi_reductions where the reduction dimension(s) is/are distributed across subgroups. We perform the existing reduction distribution, however, we are left with partial reductions accross subgroups. Thereafter, we insert tranfer_write / transfer_read to shared memory to achieve a layout change where we re-distribute reduction subgroup tiles into element tile. Finally, we do another multi_reduction to complete the reduction. Signed-off-by: Manupa Karunaratne <[email protected]>

manupak · 2025-01-07T11:23:39Z

I have verified this now with following two test cases for numerical accuracy:

#tuning = #iree_codegen.compilation_info<
  lowering_config = #iree_gpu.lowering_config<{workgroup = [0],
                                               reduction = [1024],
                                               thread_basis = [[64], [0]],
                                               subgroup_basis = [[4], [0]]
                                               }>,
  translation_info = #iree_codegen.translation_info< pipeline = LLVMGPUVectorDistribute
                                               workgroup_size = [256, 1, 1]
                                               subgroup_size = 64, {}>
>

  hal.executable private @reduce_fp16_dispatch_0 {
    hal.executable.variant public @rocm_hsaco_fb target(<"rocm", "rocm-hsaco-fb", {abi = "hip", iree.gpu.target = #iree_gpu.target<arch = "gfx942", features = "", wgp = <compute =  fp64|fp32|fp16|int64|int32|int16|int8, storage =  b64|b32|b16|b8, subgroup =  shuffle|arithmetic, dot =  dp4xi8toi32, mma = [<MFMA_F32_16x16x4_F32>, <MFMA_F32_16x16x16_F16>, <MFMA_F32_32x32x8_F16>, <MFMA_F64_16x16x4_F64>, <MFMA_F32_16x16x16_BF16>, <MFMA_F32_32x32x8_BF16>, <MFMA_F32_16x16x32_F8E5M2FNUZ>, <MFMA_F32_16x16x32_F8E5M2FNUZ_F8E4M3FNUZ>, <MFMA_F32_16x16x32_F8E4M3FNUZ>, <MFMA_F32_16x16x32_F8E4M3FNUZ_F8E5M2FNUZ>, <MFMA_F32_32x32x16_F8E5M2FNUZ>, <MFMA_F32_32x32x16_F8E5M2FNUZ_F8E4M3FNUZ>, <MFMA_F32_32x32x16_F8E4M3FNUZ>, <MFMA_F32_32x32x16_F8E4M3FNUZ_F8E5M2FNUZ>, <MFMA_I32_16x16x32_I8>, <MFMA_I32_32x32x16_I8>], subgroup_size_choices = [64], max_workgroup_sizes = [1024, 1024, 1024], max_thread_count_per_workgroup = 1024, max_workgroup_memory_bytes = 65536, max_workgroup_counts = [2147483647, 2147483647, 2147483647], max_load_instruction_bits = 128, simds_per_wgp = 4, vgpr_space_bits = 16384>>, ukernels = "none"}>) {
      hal.executable.export public @reduce_fp16_dispatch_0_generic_4096_f32 ordinal(0) layout(#hal.pipeline.layout<bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>) {
      ^bb0(%arg0: !hal.device):
        %x, %y, %z = flow.dispatch.workgroup_count_from_slice 
        hal.return %x, %y, %z : index, index, index
      }
      builtin.module {
        func.func @reduce_fp16_dispatch_0_generic_4096_f32() {
          %cst = arith.constant 0.000000e+00 : f32
          %c0 = arith.constant 0 : index
          %0 = hal.interface.binding.subspan layout(<bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>) binding(0) alignment(64) offset(%c0) flags("ReadOnly|Indirect") : !flow.dispatch.tensor<readonly:tensor<4096xf32>>
          %1 = hal.interface.binding.subspan layout(<bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>) binding(1) alignment(64) offset(%c0) flags(Indirect) : !flow.dispatch.tensor<writeonly:tensor<f32>>
          %2 = flow.dispatch.tensor.load %0, offsets = [0], sizes = [4096], strides = [1] : !flow.dispatch.tensor<readonly:tensor<4096xf32>> -> tensor<4096xf32>
          %3 = tensor.empty() : tensor<f32>
          %4 = linalg.fill ins(%cst : f32) outs(%3 : tensor<f32>) -> tensor<f32>
          %5 = linalg.generic {indexing_maps = [affine_map<(d0) -> (d0)>, affine_map<(d0) -> ()>], iterator_types = ["reduction"], compilation_info = #tuning} ins(%2 : tensor<4096xf32>) outs(%4 : tensor<f32>) {
          ^bb0(%in: f32, %out: f32):
            %6 = arith.addf %out, %in : f32
            linalg.yield %6 : f32
          } -> tensor<f32>
          flow.dispatch.tensor.store %5, %1, offsets = [], sizes = [], strides = [] : tensor<f32> -> !flow.dispatch.tensor<writeonly:tensor<f32>>
          return
        }
      }
    }
  }

#tuning = #iree_codegen.compilation_info<
  lowering_config = #iree_gpu.lowering_config<{workgroup = [4, 0],
                                               reduction = [0, 256],
                                               thread_basis = [[4, 16], [0, 1]],
                                               subgroup_basis = [[1, 4], [0, 1]]
                                               }>,
  translation_info = #iree_codegen.translation_info< pipeline = LLVMGPUVectorDistribute
                                               workgroup_size = [256, 1, 1]
                                               subgroup_size = 64, {}>
>

hal.executable private @reduce_fp16_dispatch_0 {
    hal.executable.variant public @rocm_hsaco_fb target(<"rocm", "rocm-hsaco-fb", {abi = "hip", iree.gpu.target = #iree_gpu.target<arch = "gfx942", features = "", wgp = <compute =  fp64|fp32|fp16|int64|int32|int16|int8, storage =  b64|b32|b16|b8, subgroup =  shuffle|arithmetic, dot =  dp4xi8toi32, mma = [<MFMA_F32_16x16x4_F32>, <MFMA_F32_16x16x16_F16>, <MFMA_F32_32x32x8_F16>, <MFMA_F64_16x16x4_F64>, <MFMA_F32_16x16x16_BF16>, <MFMA_F32_32x32x8_BF16>, <MFMA_F32_16x16x32_F8E5M2FNUZ>, <MFMA_F32_16x16x32_F8E5M2FNUZ_F8E4M3FNUZ>, <MFMA_F32_16x16x32_F8E4M3FNUZ>, <MFMA_F32_16x16x32_F8E4M3FNUZ_F8E5M2FNUZ>, <MFMA_F32_32x32x16_F8E5M2FNUZ>, <MFMA_F32_32x32x16_F8E5M2FNUZ_F8E4M3FNUZ>, <MFMA_F32_32x32x16_F8E4M3FNUZ>, <MFMA_F32_32x32x16_F8E4M3FNUZ_F8E5M2FNUZ>, <MFMA_I32_16x16x32_I8>, <MFMA_I32_32x32x16_I8>], subgroup_size_choices = [64], max_workgroup_sizes = [1024, 1024, 1024], max_thread_count_per_workgroup = 1024, max_workgroup_memory_bytes = 65536, max_workgroup_counts = [2147483647, 2147483647, 2147483647], max_load_instruction_bits = 128, simds_per_wgp = 4, vgpr_space_bits = 16384>>, ukernels = "none"}>) {
      hal.executable.export public @reduce_fp16_dispatch_0_generic_4x4096_f32 ordinal(0) layout(#hal.pipeline.layout<bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>) {
      ^bb0(%arg0: !hal.device):
        %x, %y, %z = flow.dispatch.workgroup_count_from_slice 
        hal.return %x, %y, %z : index, index, index
      }
      builtin.module {
        func.func @reduce_fp16_dispatch_0_generic_4x4096_f32() {
          %cst = arith.constant 0.000000e+00 : f32
          %c0 = arith.constant 0 : index
          %0 = hal.interface.binding.subspan layout(<bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>) binding(0) alignment(64) offset(%c0) flags("ReadOnly|Indirect") : !flow.dispatch.tensor<readonly:tensor<4x4096xf32>>
          %1 = hal.interface.binding.subspan layout(<bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>) binding(1) alignment(64) offset(%c0) flags(Indirect) : !flow.dispatch.tensor<writeonly:tensor<4xf32>>
          %2 = flow.dispatch.tensor.load %0, offsets = [0, 0], sizes = [4, 4096], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<4x4096xf32>> -> tensor<4x4096xf32>
          %3 = tensor.empty() : tensor<4xf32>
          %4 = linalg.fill ins(%cst : f32) outs(%3 : tensor<4xf32>) -> tensor<4xf32>
          %5 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0)>], iterator_types = ["parallel", "reduction"], compilation_info = #tuning} ins(%2 : tensor<4x4096xf32>) outs(%4 : tensor<4xf32>) {
          ^bb0(%in: f32, %out: f32):
            %6 = arith.addf %out, %in : f32
            linalg.yield %6 : f32
          } -> tensor<4xf32>
          flow.dispatch.tensor.store %5, %1, offsets = [0], sizes = [4], strides = [1] : tensor<4xf32> -> !flow.dispatch.tensor<writeonly:tensor<4xf32>>
          return
        }
      }
    }
  }

manupak requested review from antiagainst and qedawkins as code owners January 3, 2025 17:38

manupak marked this pull request as draft January 3, 2025 17:38

manupak requested a review from Groverkss January 3, 2025 17:40

manupak commented Jan 3, 2025

View reviewed changes

compiler/src/iree/compiler/Codegen/Common/GPU/GPUVectorDistribution.cpp Outdated Show resolved Hide resolved

manupak force-pushed the inter-sg-reduce branch 5 times, most recently from 0b704e1 to 993249f Compare January 7, 2025 11:20

manupak marked this pull request as ready for review January 7, 2025 11:23

manupak requested a review from MaheshRavishankar January 7, 2025 11:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LLVMGPUVectorDistribute] Add support for inter-subgroup multi_reduction #19596

[LLVMGPUVectorDistribute] Add support for inter-subgroup multi_reduction #19596

manupak commented Jan 3, 2025

manupak commented Jan 3, 2025

manupak commented Jan 7, 2025

[LLVMGPUVectorDistribute] Add support for inter-subgroup multi_reduction #19596

Are you sure you want to change the base?

[LLVMGPUVectorDistribute] Add support for inter-subgroup multi_reduction #19596

Conversation

manupak commented Jan 3, 2025

manupak commented Jan 3, 2025

manupak commented Jan 7, 2025