Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New precheck procedure to enhance stability. #1453

Conversation

BalaBalaYi
Copy link
Collaborator

What changes were proposed in this pull request?

  1. Pre-check operator api definition.
  2. Design doc.
  3. Base implement of pre-check procedure including master and worker.

Why are the changes needed?

For details, please see the design document in the current PR.

Does this PR introduce any user-facing change?

User can enable or disable the pre-check function through job args. For details, please see the development document in the current PR.

How was this patch tested?

UT and simple training job.

@BalaBalaYi BalaBalaYi added this to the v0.5.0 milestone Jan 26, 2025
@BalaBalaYi BalaBalaYi self-assigned this Jan 26, 2025
Copy link

codecov bot commented Jan 26, 2025

Codecov Report

Attention: Patch coverage is 91.32420% with 19 lines in your changes missing coverage. Please review.

Project coverage is 82.21%. Comparing base (09a46f9) to head (cf3fde9).
Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
...rover/python/master/diagnosis/diagnosis_manager.py 89.18% 4 Missing ⚠️
dlrover/python/elastic_agent/master_client.py 25.00% 3 Missing ⚠️
...rover/python/master/diagnosis/precheck_operator.py 91.17% 3 Missing ⚠️
dlrover/python/master/servicer.py 60.00% 2 Missing ⚠️
dlrover/trainer/torch/elastic_run.py 83.33% 2 Missing ⚠️
dlrover/python/master/args.py 88.88% 1 Missing ⚠️
dlrover/python/master/diagnosis/diagnosis.py 50.00% 1 Missing ⚠️
dlrover/python/master/main.py 0.00% 1 Missing ⚠️
dlrover/python/master/node/job_context.py 90.00% 1 Missing ⚠️
dlrover/python/tests/test_pre_check_operator.py 94.11% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1453      +/-   ##
==========================================
+ Coverage   82.14%   82.21%   +0.06%     
==========================================
  Files         253      255       +2     
  Lines       25288    25498     +210     
==========================================
+ Hits        20774    20964     +190     
- Misses       4514     4534      +20     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

BalaBalaYi and others added 12 commits January 26, 2025 18:54
# Conflicts:
#	dlrover/python/common/global_context.py
#	dlrover/python/master/diagnosis/diagnosis_manager.py
#	dlrover/python/tests/test_args.py
#	dlrover/python/tests/test_diagnosis_manager.py
#	docs/deployment/argument.md
# Conflicts:
#	dlrover/python/common/constants.py
docs/design/training-pre-check.md Outdated Show resolved Hide resolved
docs/design/training-pre-check.md Outdated Show resolved Hide resolved
dlrover/python/common/global_context.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@samplise samplise left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@BalaBalaYi BalaBalaYi merged commit e33cfb4 into intelligent-machine-learning:master Feb 8, 2025
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants