Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

misc+doc: patch file for pytorch's nccl support #11

Merged
merged 1 commit into from
Jun 7, 2024

Conversation

myungjin
Copy link
Contributor

@myungjin myungjin commented Jun 6, 2024

Description

This patch file updates the ncclResult_t type. ncclRemoteError code is added since nccl v2.13.4. In pytorch's code, the code is not updated. This causes the Unconvertible NCCL type error when a remote worker terminates or crashes.

Also, 'AutoNcclGroup nccl_group_guard' is removed to allow 'TORCH_NCCL_ASYNC_ERROR_HANDLING = 2 (CleanUpOnly)' to work. with nccl_group_guard, a new exception is thrown while another exception is propagating, which causes the termination of the process in c++. To make CleanUpOnly possible, this guard is removed.

The details on this patch are also documented.

Type of Change

  • Bug Fix
  • New Feature
  • Breaking Change
  • Refactor
  • Documentation
  • Other (please describe)

Checklist

  • I have read the contributing guidelines
  • Existing issues have been referenced (where applicable)
  • I have verified this change is not present in other open pull requests
  • Functionality is documented
  • All code style checks pass
  • New code contribution is covered by automated tests
  • All new and existing tests pass

@myungjin myungjin force-pushed the torch_nccl branch 6 times, most recently from eba3baa to 45eb90e Compare June 7, 2024 16:45
This patch file updates the ncclResult_t type. ncclRemoteError code
is added since nccl v2.13.4. In pytorch's code, the code is not updated.
This causes the Unconvertible NCCL type error when a remote worker
terminates or crashes.

Also, 'AutoNcclGroup nccl_group_guard' is removed to allow
'TORCH_NCCL_ASYNC_ERROR_HANDLING = 2 (CleanUpOnly)' to work.
with nccl_group_guard, a new exception is thrown while another exception
is propagating, which causes the termination of the process in c++.
To make CleanUpOnly possible, this guard is removed.

The details on this patch are also documented.
@myungjin myungjin merged commit c7c440e into cisco-open:main Jun 7, 2024
1 check passed
@myungjin myungjin deleted the torch_nccl branch June 7, 2024 19:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant