Add distributed PointToPoint op send/recv/reduce/all2all/gather/scatter #1204

Chao1Han · 2024-12-23T07:23:25Z

Motivation:

implement pointtopoint op send/recv/reduce/all2all/gather/scatter

zhangxiaoli73 · 2024-12-30T01:44:09Z

src/xccl/ProcessGroupXCCL.cpp

+    work->future_->markCompleted(at::IValue(*work->outputs_));
+    return work;
+  } else {
+    at::xpu::OptionalXPUGuard gpuGuard(device);


Please add some comments to when work will be constructed in coalesced collective.

zhangxiaoli73 · 2024-12-30T01:44:45Z

src/xccl/ProcessGroupXCCL.cpp

+  TORCH_CHECK(tensors.size() == 1, MULTI_DEVICE_ERROR_MSG);
+  // @lint-ignore CLANGTIDY
+  auto tensor = tensors.back();
+  check_xpu_single_tensor(tensor, true);


I think the name is already refined in your last PR. Please update.

zhangxiaoli73 · 2024-12-30T01:47:00Z

src/xccl/ProcessGroupXCCL.cpp

+                (outputSplitsEqual ? outLen : outputSplitSizes[i] * outLen);
+          }
+          auto xcclDataType = getXcclDataType(output.scalar_type());
+          ccl::alltoallv(


alltoallv will not exsit in new oneCCL API. Please use send/recv pair for alltoallv.

I will do this as soon as the new API is ready.

zhangxiaoli73 · 2024-12-30T01:47:13Z

src/xccl/ProcessGroupXCCL.cpp

+  auto device = outputTensors[0].device();
+  int64_t total_numel = 0;
+  for (const auto r : c10::irange(outputTensors.size())) {
+    check_xpu_single_tensor(outputTensors[r], true);


Refine the name.

Chao1Han added 30 commits November 20, 2024 02:17

Happy Init

90a52d3

oneccl private for xccl

0d8bb51

update cmake

f01b173

update

405013c

update cmake

7714885

Merge branch 'main' into chao/xccl

58a64a6

update commit and add register

b770640

update

30f6cd2

Merge branch 'main' into chao/xccl

8fff100

imple allreduce and strcture

fb851b1

add non-reduction datatype

b1aee26

add comment

c55b16e

Simply cmake logit

d139548

update

b8e9f30

Merge branch 'main' into chao/xccl

0fe320b

update findxccl logit like mkl

86f09cb

add oneccl path to cmake include

d8c1e97

add deault oneapi path

4b0eba0

rm default find path due to user source oneapi mandatory

72b2687

add simple xccl test

5a40bd4

update find ccl

1989262

Merge branch 'main' into chao/xccl

76d48bd

Add group op

dfb6f3a

add cases

75a58ee

rm ut

a71447e

Merge branch 'chao/xccl' into chao/xccl2

3f0f77b

rm test_case

5904ca5

add p2p op

781f8a8

update

10f5b35

update

2a80dce

zhangxiaoli73 reviewed Dec 30, 2024

View reviewed changes

Chao1Han added 3 commits December 30, 2024 17:40

Merge branch 'chao/xccl2' into chao/xccl3

cb6ac73

refine api name

342517a

add comments

106adb5

Chao1Han force-pushed the chao/xccl3 branch from 0002a7b to eb7d869 Compare December 31, 2024 02:03

Chao1Han added 3 commits December 31, 2024 17:57

Merge branch 'chao/xccl2' into chao/xccl3

eb7d869

Merge remote-tracking branch 'origin/main' into chao/xccl2

f589549

Merge branch 'chao/xccl2' into chao/xccl3

5cd8b0f

Base automatically changed from chao/xccl2 to main January 8, 2025 05:58

Chao1Han changed the title ~~[wip] add p2p op~~ Add distributed PointToPoint op send/recv/reduce/all2all/gather/scater Jan 8, 2025

Chao1Han changed the title ~~Add distributed PointToPoint op send/recv/reduce/all2all/gather/scater~~ Add distributed PointToPoint op send/recv/reduce/all2all/gather/scatter Jan 8, 2025

Chao1Han added 2 commits January 8, 2025 23:24

Merge remote-tracking branch 'origin/main' into chao/xccl3

32a1d53

Merge branch 'main' into chao/xccl3

bdc8d98

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add distributed PointToPoint op send/recv/reduce/all2all/gather/scatter #1204

Add distributed PointToPoint op send/recv/reduce/all2all/gather/scatter #1204

Chao1Han commented Dec 23, 2024 •

edited

Loading

zhangxiaoli73 Dec 30, 2024

zhangxiaoli73 Dec 30, 2024

Chao1Han Jan 8, 2025

zhangxiaoli73 Dec 30, 2024

Chao1Han Jan 8, 2025

zhangxiaoli73 Dec 30, 2024

Chao1Han Jan 8, 2025

Add distributed PointToPoint op send/recv/reduce/all2all/gather/scatter #1204

Are you sure you want to change the base?

Add distributed PointToPoint op send/recv/reduce/all2all/gather/scatter #1204

Conversation

Chao1Han commented Dec 23, 2024 • edited Loading

Motivation:

zhangxiaoli73 Dec 30, 2024

Choose a reason for hiding this comment

zhangxiaoli73 Dec 30, 2024

Choose a reason for hiding this comment

Chao1Han Jan 8, 2025

Choose a reason for hiding this comment

zhangxiaoli73 Dec 30, 2024

Choose a reason for hiding this comment

Chao1Han Jan 8, 2025

Choose a reason for hiding this comment

zhangxiaoli73 Dec 30, 2024

Choose a reason for hiding this comment

Chao1Han Jan 8, 2025

Choose a reason for hiding this comment

Chao1Han commented Dec 23, 2024 •

edited

Loading