-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add distributed collective op broadcast/allgather/reduce_scatter/barrier #1202
Conversation
}, | ||
[](at::xpu::XPUStream&, | ||
c10::intrusive_ptr<ProcessGroupXCCL::WorkXCCL>&) { | ||
ccl::group_end(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think groupStart
/groupEnd
wraps ccl::group_start
/ccl::group_end
. Then should you call the wrapped API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xcclActiveGroupCounter_ affect batchP2P choice. lets use origin api like nccl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added comment here
src/xccl/ProcessGroupXCCL.cpp
Outdated
return true; | ||
} | ||
|
||
void check_xpu_single_tensor( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should you follow the same naming format like checkSingleTensor
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
modified
src/xccl/ProcessGroupXCCL.cpp
Outdated
} | ||
} | ||
} | ||
|
||
int64_t check_xpu_tensors_same_device(const std::vector<at::Tensor>& tensors) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should you follow the same naming format like checkTensorOnSameDevice
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
modified
@@ -62,6 +109,10 @@ ccl::reduction getXcclReduceOp(const ReduceOp& reduceOp, at::Tensor& input) { | |||
// Map sum to max for bool tensors to avoid overflow issues with sum. | |||
return ccl::reduction::max; | |||
} | |||
// WA due to oneCCL not support AVG | |||
if (reduceOp == ReduceOp::AVG) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The WA does not mean simply replacing avg
with sum
, but using sum
collective and div
SYCL kernel to simulate avg
. Please update your comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
modified
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also add comments that oneCCL is expected to support avg in basekit 2025.2 release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
src/xccl/ProcessGroupXCCL.cpp
Outdated
@@ -31,22 +31,69 @@ const std::map<at::ScalarType, ccl::datatype> xcclDatatypes = { | |||
{at::kFloat8_e5m2fnuz, ccl::datatype::uint8}, | |||
}; | |||
|
|||
void checkXPUTensor(at::Tensor& tensor) { | |||
bool check_same_size(const std::vector<at::Tensor>& input_tensors) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please refine the API name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
TORCH_CHECK( | ||
!isFloat8Type(type) && is_reduction_op, | ||
"Float8 dtypes are not currenlty supported for XCCL reductions"); | ||
if (is_reduction_op) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you need to change the check format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix the logical error for non-reduction operations; the previous implementation blocked all non-reduction operations.
std::make_shared<xcclComm_t>(std::move(comms[0])); | ||
XCCLComm = std::make_shared<xcclComm_t>(std::move(comms[0])); | ||
|
||
RECORD_PARAM_COMMS( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check this logic here to record params.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ccl comm create should also record like https://github.com/pytorch/pytorch/blob/168c2cb3f3211e5fc2110b5f1e982793a04437a4/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L2627, and seq=0 means first init ccl comm in standalone collective.
Motivation:
implement collectives op
broadcast
,allreduce_coalesced
,allgather
,_allgather_base
,allgather_coalesced
,allgather_into_tensor_coalesced
,reduce_scatter
,_reduce_scatter_base
,reduce_scatter_tensor_coalesced
,barrier
.