Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LLVMGPUVectorDistribute] Add support for inter-subgroup multi_reduction #19596

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

manupak
Copy link
Contributor

@manupak manupak commented Jan 3, 2025

This commit adds support for distribute multi_reductions where the reduction dimension(s) is/are distributed across subgroups.

We perform the existing reduction distribution, however, we are left with partial reductions accross subgroups.

Thereafter, we insert tranfer_write / transfer_read to shared memory to achieve a layout change where
we re-distribute reduction subgroup tiles into element tile. Finally, we do another multi_reduction to complete the reduction.

closes: #19578

@manupak manupak marked this pull request as draft January 3, 2025 17:38
@manupak
Copy link
Contributor Author

manupak commented Jan 3, 2025

Im putting this to draft as I have not yet numerically verified these changes.

This PR basically implements approach 3 of https://hackmd.io/bXwT715pQU-8rrS78LKCbw

@manupak manupak requested a review from Groverkss January 3, 2025 17:40
@manupak manupak force-pushed the inter-sg-reduce branch 5 times, most recently from 0b704e1 to 993249f Compare January 7, 2025 11:20
This commit adds support for distribute multi_reductions
where the reduction dimension(s) is/are distributed across
subgroups.

We perform the existing reduction distribution, however,
we are left with partial reductions accross subgroups.

Thereafter, we insert tranfer_write / transfer_read
to shared memory to achieve a layout change where
we re-distribute reduction subgroup tiles into element
tile. Finally, we do another multi_reduction to complete
the reduction.

Signed-off-by: Manupa Karunaratne <[email protected]>
@manupak
Copy link
Contributor Author

manupak commented Jan 7, 2025

I have verified this now with following two test cases for numerical accuracy:

#tuning = #iree_codegen.compilation_info<
  lowering_config = #iree_gpu.lowering_config<{workgroup = [0],
                                               reduction = [1024],
                                               thread_basis = [[64], [0]],
                                               subgroup_basis = [[4], [0]]
                                               }>,
  translation_info = #iree_codegen.translation_info< pipeline = LLVMGPUVectorDistribute
                                               workgroup_size = [256, 1, 1]
                                               subgroup_size = 64, {}>
>

  hal.executable private @reduce_fp16_dispatch_0 {
    hal.executable.variant public @rocm_hsaco_fb target(<"rocm", "rocm-hsaco-fb", {abi = "hip", iree.gpu.target = #iree_gpu.target<arch = "gfx942", features = "", wgp = <compute =  fp64|fp32|fp16|int64|int32|int16|int8, storage =  b64|b32|b16|b8, subgroup =  shuffle|arithmetic, dot =  dp4xi8toi32, mma = [<MFMA_F32_16x16x4_F32>, <MFMA_F32_16x16x16_F16>, <MFMA_F32_32x32x8_F16>, <MFMA_F64_16x16x4_F64>, <MFMA_F32_16x16x16_BF16>, <MFMA_F32_32x32x8_BF16>, <MFMA_F32_16x16x32_F8E5M2FNUZ>, <MFMA_F32_16x16x32_F8E5M2FNUZ_F8E4M3FNUZ>, <MFMA_F32_16x16x32_F8E4M3FNUZ>, <MFMA_F32_16x16x32_F8E4M3FNUZ_F8E5M2FNUZ>, <MFMA_F32_32x32x16_F8E5M2FNUZ>, <MFMA_F32_32x32x16_F8E5M2FNUZ_F8E4M3FNUZ>, <MFMA_F32_32x32x16_F8E4M3FNUZ>, <MFMA_F32_32x32x16_F8E4M3FNUZ_F8E5M2FNUZ>, <MFMA_I32_16x16x32_I8>, <MFMA_I32_32x32x16_I8>], subgroup_size_choices = [64], max_workgroup_sizes = [1024, 1024, 1024], max_thread_count_per_workgroup = 1024, max_workgroup_memory_bytes = 65536, max_workgroup_counts = [2147483647, 2147483647, 2147483647], max_load_instruction_bits = 128, simds_per_wgp = 4, vgpr_space_bits = 16384>>, ukernels = "none"}>) {
      hal.executable.export public @reduce_fp16_dispatch_0_generic_4096_f32 ordinal(0) layout(#hal.pipeline.layout<bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>) {
      ^bb0(%arg0: !hal.device):
        %x, %y, %z = flow.dispatch.workgroup_count_from_slice 
        hal.return %x, %y, %z : index, index, index
      }
      builtin.module {
        func.func @reduce_fp16_dispatch_0_generic_4096_f32() {
          %cst = arith.constant 0.000000e+00 : f32
          %c0 = arith.constant 0 : index
          %0 = hal.interface.binding.subspan layout(<bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>) binding(0) alignment(64) offset(%c0) flags("ReadOnly|Indirect") : !flow.dispatch.tensor<readonly:tensor<4096xf32>>
          %1 = hal.interface.binding.subspan layout(<bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>) binding(1) alignment(64) offset(%c0) flags(Indirect) : !flow.dispatch.tensor<writeonly:tensor<f32>>
          %2 = flow.dispatch.tensor.load %0, offsets = [0], sizes = [4096], strides = [1] : !flow.dispatch.tensor<readonly:tensor<4096xf32>> -> tensor<4096xf32>
          %3 = tensor.empty() : tensor<f32>
          %4 = linalg.fill ins(%cst : f32) outs(%3 : tensor<f32>) -> tensor<f32>
          %5 = linalg.generic {indexing_maps = [affine_map<(d0) -> (d0)>, affine_map<(d0) -> ()>], iterator_types = ["reduction"], compilation_info = #tuning} ins(%2 : tensor<4096xf32>) outs(%4 : tensor<f32>) {
          ^bb0(%in: f32, %out: f32):
            %6 = arith.addf %out, %in : f32
            linalg.yield %6 : f32
          } -> tensor<f32>
          flow.dispatch.tensor.store %5, %1, offsets = [], sizes = [], strides = [] : tensor<f32> -> !flow.dispatch.tensor<writeonly:tensor<f32>>
          return
        }
      }
    }
  }
#tuning = #iree_codegen.compilation_info<
  lowering_config = #iree_gpu.lowering_config<{workgroup = [4, 0],
                                               reduction = [0, 256],
                                               thread_basis = [[4, 16], [0, 1]],
                                               subgroup_basis = [[1, 4], [0, 1]]
                                               }>,
  translation_info = #iree_codegen.translation_info< pipeline = LLVMGPUVectorDistribute
                                               workgroup_size = [256, 1, 1]
                                               subgroup_size = 64, {}>
>

hal.executable private @reduce_fp16_dispatch_0 {
    hal.executable.variant public @rocm_hsaco_fb target(<"rocm", "rocm-hsaco-fb", {abi = "hip", iree.gpu.target = #iree_gpu.target<arch = "gfx942", features = "", wgp = <compute =  fp64|fp32|fp16|int64|int32|int16|int8, storage =  b64|b32|b16|b8, subgroup =  shuffle|arithmetic, dot =  dp4xi8toi32, mma = [<MFMA_F32_16x16x4_F32>, <MFMA_F32_16x16x16_F16>, <MFMA_F32_32x32x8_F16>, <MFMA_F64_16x16x4_F64>, <MFMA_F32_16x16x16_BF16>, <MFMA_F32_32x32x8_BF16>, <MFMA_F32_16x16x32_F8E5M2FNUZ>, <MFMA_F32_16x16x32_F8E5M2FNUZ_F8E4M3FNUZ>, <MFMA_F32_16x16x32_F8E4M3FNUZ>, <MFMA_F32_16x16x32_F8E4M3FNUZ_F8E5M2FNUZ>, <MFMA_F32_32x32x16_F8E5M2FNUZ>, <MFMA_F32_32x32x16_F8E5M2FNUZ_F8E4M3FNUZ>, <MFMA_F32_32x32x16_F8E4M3FNUZ>, <MFMA_F32_32x32x16_F8E4M3FNUZ_F8E5M2FNUZ>, <MFMA_I32_16x16x32_I8>, <MFMA_I32_32x32x16_I8>], subgroup_size_choices = [64], max_workgroup_sizes = [1024, 1024, 1024], max_thread_count_per_workgroup = 1024, max_workgroup_memory_bytes = 65536, max_workgroup_counts = [2147483647, 2147483647, 2147483647], max_load_instruction_bits = 128, simds_per_wgp = 4, vgpr_space_bits = 16384>>, ukernels = "none"}>) {
      hal.executable.export public @reduce_fp16_dispatch_0_generic_4x4096_f32 ordinal(0) layout(#hal.pipeline.layout<bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>) {
      ^bb0(%arg0: !hal.device):
        %x, %y, %z = flow.dispatch.workgroup_count_from_slice 
        hal.return %x, %y, %z : index, index, index
      }
      builtin.module {
        func.func @reduce_fp16_dispatch_0_generic_4x4096_f32() {
          %cst = arith.constant 0.000000e+00 : f32
          %c0 = arith.constant 0 : index
          %0 = hal.interface.binding.subspan layout(<bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>) binding(0) alignment(64) offset(%c0) flags("ReadOnly|Indirect") : !flow.dispatch.tensor<readonly:tensor<4x4096xf32>>
          %1 = hal.interface.binding.subspan layout(<bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>) binding(1) alignment(64) offset(%c0) flags(Indirect) : !flow.dispatch.tensor<writeonly:tensor<4xf32>>
          %2 = flow.dispatch.tensor.load %0, offsets = [0, 0], sizes = [4, 4096], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<4x4096xf32>> -> tensor<4x4096xf32>
          %3 = tensor.empty() : tensor<4xf32>
          %4 = linalg.fill ins(%cst : f32) outs(%3 : tensor<4xf32>) -> tensor<4xf32>
          %5 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0)>], iterator_types = ["parallel", "reduction"], compilation_info = #tuning} ins(%2 : tensor<4x4096xf32>) outs(%4 : tensor<4xf32>) {
          ^bb0(%in: f32, %out: f32):
            %6 = arith.addf %out, %in : f32
            linalg.yield %6 : f32
          } -> tensor<4xf32>
          flow.dispatch.tensor.store %5, %1, offsets = [0], sizes = [4], strides = [1] : tensor<4xf32> -> !flow.dispatch.tensor<writeonly:tensor<4xf32>>
          return
        }
      }
    }
  }

@manupak manupak marked this pull request as ready for review January 7, 2025 11:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[LLVMGPUVectorDistribute] Add inter-subgroup reduction support for MultiReduction
1 participant