Add distributed collective op broadcast/allgather/reduce_scatter/barrier #1202

Chao1Han · 2024-12-23T05:11:20Z

Motivation:

implement collectives op broadcast, allreduce_coalesced, allgather, _allgather_base, allgather_coalesced, allgather_into_tensor_coalesced, reduce_scatter, _reduce_scatter_base, reduce_scatter_tensor_coalesced, barrier.

zhangxiaoli73 · 2024-12-23T08:47:49Z

src/xccl/ProcessGroupXCCL.hpp

+        },
+        [](at::xpu::XPUStream&,
+           c10::intrusive_ptr<ProcessGroupXCCL::WorkXCCL>&) {
+          ccl::group_end();


I think groupStart/groupEnd wraps ccl::group_start/ccl::group_end. Then should you call the wrapped API?

xcclActiveGroupCounter_ affect batchP2P choice. lets use origin api like nccl

added comment here

zhangxiaoli73 · 2024-12-23T08:49:02Z

src/xccl/ProcessGroupXCCL.cpp

+  return true;
+}
+
+void check_xpu_single_tensor(


Should you follow the same naming format like checkSingleTensor?

zhangxiaoli73 · 2024-12-23T08:49:39Z

src/xccl/ProcessGroupXCCL.cpp

    }
  }
 }

+int64_t check_xpu_tensors_same_device(const std::vector<at::Tensor>& tensors) {


Should you follow the same naming format like checkTensorOnSameDevice?

zhangxiaoli73 · 2024-12-23T08:52:37Z

src/xccl/ProcessGroupXCCL.cpp

@@ -62,6 +109,10 @@ ccl::reduction getXcclReduceOp(const ReduceOp& reduceOp, at::Tensor& input) {
      // Map sum to max for bool tensors to avoid overflow issues with sum.
      return ccl::reduction::max;
    }
+    // WA due to oneCCL not support AVG
+    if (reduceOp == ReduceOp::AVG) {


The WA does not mean simply replacing avg with sum, but using sum collective and div SYCL kernel to simulate avg. Please update your comment.

Please also add comments that oneCCL is expected to support avg in basekit 2025.2 release.

zhangxiaoli73 · 2024-12-30T01:55:22Z

src/xccl/ProcessGroupXCCL.cpp

@@ -31,22 +31,69 @@ const std::map<at::ScalarType, ccl::datatype> xcclDatatypes = {
    {at::kFloat8_e5m2fnuz, ccl::datatype::uint8},
 };

-void checkXPUTensor(at::Tensor& tensor) {
+bool check_same_size(const std::vector<at::Tensor>& input_tensors) {


Please refine the API name.

zhangxiaoli73 · 2025-01-07T02:07:41Z

src/xccl/ProcessGroupXCCL.cpp

-  TORCH_CHECK(
-      !isFloat8Type(type) && is_reduction_op,
-      "Float8 dtypes are not currenlty supported for XCCL reductions");
+  if (is_reduction_op)


why do you need to change the check format?

Fix the logical error for non-reduction operations; the previous implementation blocked all non-reduction operations.

zhangxiaoli73 · 2025-01-07T02:10:42Z

src/xccl/ProcessGroupXCCL.cpp

-      std::make_shared<xcclComm_t>(std::move(comms[0]));
+  XCCLComm = std::make_shared<xcclComm_t>(std::move(comms[0]));
+
+  RECORD_PARAM_COMMS(


Please check this logic here to record params.

ccl comm create should also record like https://github.com/pytorch/pytorch/blob/168c2cb3f3211e5fc2110b5f1e982793a04437a4/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L2627, and seq=0 means first init ccl comm in standalone collective.

Chao1Han added 22 commits November 20, 2024 02:17

Happy Init

90a52d3

oneccl private for xccl

0d8bb51

update cmake

f01b173

update

405013c

update cmake

7714885

Merge branch 'main' into chao/xccl

58a64a6

update commit and add register

b770640

update

30f6cd2

Merge branch 'main' into chao/xccl

8fff100

imple allreduce and strcture

fb851b1

add non-reduction datatype

b1aee26

add comment

c55b16e

Simply cmake logit

d139548

update

b8e9f30

Merge branch 'main' into chao/xccl

0fe320b

update findxccl logit like mkl

86f09cb

add oneccl path to cmake include

d8c1e97

add deault oneapi path

4b0eba0

rm default find path due to user source oneapi mandatory

72b2687

add simple xccl test

5a40bd4

update find ccl

1989262

Merge branch 'main' into chao/xccl

76d48bd

zhangxiaoli73 reviewed Dec 23, 2024

View reviewed changes

Chao1Han added 4 commits December 23, 2024 19:06

Add group op

dfb6f3a

add cases

75a58ee

rm ut

a71447e

Merge branch 'chao/xccl' into chao/xccl2

3f0f77b

Chao1Han added 2 commits December 23, 2024 23:12

rm test_case

5904ca5

update

2a80dce

zhangxiaoli73 reviewed Dec 30, 2024

View reviewed changes

add comments

106adb5

Base automatically changed from chao/xccl to main January 7, 2025 01:47

zhangxiaoli73 reviewed Jan 7, 2025

View reviewed changes

Chao1Han changed the title ~~[wip] group op~~ Add distributed collective op broadcast/allgather/reduce_scatter/barrier Jan 7, 2025

zhangxiaoli73 requested a review from gujinghui January 7, 2025 05:50

Chao1Han added 2 commits January 7, 2025 17:52

Merge remote-tracking branch 'origin/main' into chao/xccl2

f589549

add comment

d33084f

gujinghui approved these changes Jan 8, 2025

View reviewed changes

zhangxiaoli73 added this pull request to the merge queue Jan 8, 2025

Merged via the queue into main with commit af8622f Jan 8, 2025
3 of 4 checks passed

zhangxiaoli73 deleted the chao/xccl2 branch January 8, 2025 05:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add distributed collective op broadcast/allgather/reduce_scatter/barrier #1202

Add distributed collective op broadcast/allgather/reduce_scatter/barrier #1202

Chao1Han commented Dec 23, 2024 •

edited by zhangxiaoli73

Loading

zhangxiaoli73 Dec 23, 2024

Chao1Han Dec 24, 2024

Chao1Han Jan 7, 2025

zhangxiaoli73 Dec 23, 2024

Chao1Han Dec 24, 2024

zhangxiaoli73 Dec 23, 2024

Chao1Han Dec 24, 2024

zhangxiaoli73 Dec 23, 2024

Chao1Han Dec 24, 2024

zhangxiaoli73 Dec 30, 2024

Chao1Han Dec 30, 2024

zhangxiaoli73 Dec 30, 2024

Chao1Han Dec 30, 2024

zhangxiaoli73 Jan 7, 2025

Chao1Han Jan 7, 2025

zhangxiaoli73 Jan 7, 2025

Chao1Han Jan 7, 2025

Add distributed collective op broadcast/allgather/reduce_scatter/barrier #1202

Add distributed collective op broadcast/allgather/reduce_scatter/barrier #1202

Conversation

Chao1Han commented Dec 23, 2024 • edited by zhangxiaoli73 Loading

Motivation:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Chao1Han commented Dec 23, 2024 •

edited by zhangxiaoli73

Loading