Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance torch240 rendezvous to improve fault tolerance ability. #1454

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

BalaBalaYi
Copy link
Collaborator

What changes were proposed in this pull request?

Add 'wait' logic on 2 parts:

  1. Rank0 retrieve all role info.
  2. Other ranks retrieve rank info.

Why are the changes needed?

Optimization for rendezvous logic targeting torch version greater than 2.4.0.

If a worker exits during the process of assigning RANK for rendezvous, resulting in empty metadata retrieval, the process will not immediately exit due to deserialization failure. Instead, it will wait (to prevent all networking workers from encountering errors and exiting immediately). If non-empty data cannot be obtained eventually, the process will terminate with an exception due to a pending timeout.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

UT and training(TODO).

@BalaBalaYi BalaBalaYi added the enhancement New feature or request label Jan 27, 2025
@BalaBalaYi BalaBalaYi added this to the v0.5.0 milestone Jan 27, 2025
@BalaBalaYi BalaBalaYi self-assigned this Jan 27, 2025
…rance_when_assign_workers

# Conflicts:
#	dlrover/python/tests/test_elastic_training_agent.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant