Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

simpler heartbeat based on health #599

Merged
merged 6 commits into from
Oct 29, 2024
Merged

Conversation

philsippl
Copy link
Contributor

@philsippl philsippl commented Oct 28, 2024

The NCCL heartbeat has always been a troublemaker and wasn't anyways testing the health of the NCCL connections that are actually used in the main thread.

This now just uses the /health endpoint and exposes a unique id to the other nodes. If the other nodes notice the unique id has changed they restart.

Note: This requires a new config entry to be added (without port).
SMPC__NODE_HOSTNAMES='["...", "...", "..."]'

@github-actions github-actions bot added the enhancement New feature or request label Oct 28, 2024
@wojciechsromek
Copy link
Collaborator

wojciechsromek commented Oct 29, 2024

@philsippl this will be needed before-hand: #601

It would also make sense to include some checks for DB connectivity / correct working of the queue processing checks, because otherwise this would basically be a liveness check, not healthiness one. Wdyt?

@wojciechsromek wojciechsromek merged commit e8668ab into main Oct 29, 2024
11 checks passed
@wojciechsromek wojciechsromek deleted the ps/feat/simpler-heartbeat branch October 29, 2024 15:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants