-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revise support and worker state callbacks #17
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great stuff! I have some questions about the semantics of these callbacks w.r.t error conditions, but other than that, this all seems solid.
@@ -1185,7 +1276,7 @@ function deregister_worker(pg, pid) | |||
# Notify the cluster manager of this workers death | |||
manage(w.manager, w.id, w.config, :deregister) | |||
if PGRP.topology !== :all_to_all || isclusterlazy() | |||
for rpid in workers() | |||
for rpid in filter(!=(myid()), workers()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the line above we do pg.workers = filter(x -> !(x.id == pid), pg.workers)
, isn't this sufficient to remove our pid (which is supposedly also pid
) from the results of workers()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No that only removes pid
(the ID of the dead worker) from the internal list. deregister_worker()
is called on the live workers when a worker pid
goes down, so when we recurse our own ID will still be in workers()
. This didn't cause issues before because the function can handle being called recursively, but now it would cause the exited
callbacks to be called twice.
src/cluster.jl
Outdated
new_workers = @lock worker_lock addprocs_locked(manager::ClusterManager; kwargs...) | ||
for worker in new_workers | ||
for callback in values(worker_added_callbacks) | ||
callback(worker) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If one of these callbacks throws, what should we do? Right now we'll just bail out of addprocs
, but it might make sense to make it a non-fatal error (printed with @error
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No I think we should definitely throw the exception so that it's obvious that worker initialization somehow failed, otherwise code that runs later assuming initialization succeeded may cause even more errors. But I did change how we run them in 90f44f6 so that they execute concurrently to not slow down addprocs()
too much and so that we can have warnings about slow callbacks.
add_worker_added_callback(f::Base.Callable; key=nothing) | ||
|
||
Register a callback to be called on the master process whenever a worker is | ||
added. The callback will be called with the added worker ID, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should document the on-error behavior here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, fixed in 90f44f6.
src/cluster.jl
Outdated
""" | ||
remove_worker_added_callback(key) | ||
|
||
Remove the callback for `key`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the callback for `key`. | |
Remove the callback for `key` that was added via `add_worker_added_callback`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 90f44f6.
src/cluster.jl
Outdated
|
||
Register a callback to be called on the master process whenever a worker is | ||
added. The callback will be called with the added worker ID, | ||
e.g. `f(w::Int)`. Returns a unique key for the callback. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g. `f(w::Int)`. Returns a unique key for the callback. | |
e.g. `f(w::Int)`. Chooses and returns a unique key for the callback | |
if `key` is not specified. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 90f44f6.
src/cluster.jl
Outdated
Register a callback to be called on the master process when a worker has exited | ||
for any reason (i.e. not only because of [`rmprocs()`](@ref) but also the worker | ||
segfaulting etc). The callback will be called with the worker ID, | ||
e.g. `f(w::Int)`. Returns a unique key for the callback. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g. `f(w::Int)`. Returns a unique key for the callback. | |
e.g. `f(w::Int)`. Chooses and returns a unique key for the callback | |
if `key` is not specified. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 90f44f6.
src/cluster.jl
Outdated
""" | ||
remove_worker_exiting_callback(key) | ||
|
||
Remove the callback for `key`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the callback for `key`. | |
Remove the callback for `key` that was added via `add_worker_exiting_callback`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 90f44f6.
src/cluster.jl
Outdated
""" | ||
remove_worker_exited_callback(key) | ||
|
||
Remove the callback for `key`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the callback for `key`. | |
Remove the callback for `key` that was added via `add_worker_exited_callback`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 90f44f6.
src/cluster.jl
Outdated
end | ||
|
||
if timedwait(() -> all(istaskdone.(callback_tasks)), callback_timeout) === :timed_out | ||
@warn "Some callbacks timed out, continuing to remove workers anyway" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@warn "Some callbacks timed out, continuing to remove workers anyway" | |
@warn "Some worker-exiting callbacks have not yet finished, continuing to remove workers anyway" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 90f44f6.
src/cluster.jl
Outdated
# Call callbacks on the master | ||
if myid() == 1 | ||
for callback in values(worker_exited_callbacks) | ||
callback(pid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try/catch -> @error
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's better, added in 90f44f6.
One additional thought: should we have a way to detect a worker exit reason? If a worker exits due to a segfault or OOM, this might be important to know so that I can do some recovery actions or reporting to the user. It could also be something we add later, as its own set of callbacks, but I figure it's worth thinking about before merging this. |
Previously we were not filtering out the current worker when calling `deregister_worker()` on `workers()`.
That is an excellent point 🤔 I think it should go with the current set of callbacks, e.g. what if the signature for the worker-exiting callbacks was |
Btw, I think this is really cool stuff @JamesWrigley! @jpsamaroo and I where discussing this in the context advanced scheduling (i.e. your compute might be growing or shrinking) -- or you're on shared nodes where the sysadmin might kill a worker which is leaking memory. I this context I've been building "nanny workers" that will re-add dead workers, etc. But this has been a pain in the ^%#@ I am happy with this PR as is, but fi there are spare cycles, I want propose additional improvements:
|
Thanks for taking a look :) About those things:
|
In general I find it harder to change an API after the fact ;) -- Anyway the situation you're describing is where one worker is responsible for managing the overall workflow, as is common. However this is not always the case at scale (eg. what if we had >10k workers?). In those situations you often lay out your workers on a tree with several "management" nodes (eg. think fat trees, but for workers and not networks). In this situation you want to build callbacks that notify those manager nodes that the leaves have just changed (or are about to change).
Slurm can be configured to send a signal
That's neat! I love it -- it can also help with overall workflow tooling. All-in-all I am happy with this PR as is -- my comments are meant to make somethings that's great even better. |
Alrighty, switched to
On second thoughts I'm unsure about promising logs because I don't know how the Am I correct in thinking that worker statuses and the worker-exited callbacks are sufficient for the uses you're talking about? The way I think about it is that there's three possible scenarios for exits:
|
The new `WorkerState_exterminated` state is for indicating that a worker was killed by something other than us.
Some updates:
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #17 +/- ##
==========================================
+ Coverage 88.24% 88.26% +0.01%
==========================================
Files 11 12 +1
Lines 2068 2138 +70
==========================================
+ Hits 1825 1887 +62
- Misses 243 251 +8 ☔ View full report in Codecov by Sentry. |
This should fix an exception seen in CI from the lingering timeout task: ``` Test Summary: | Pass Total Time Deserialization error recovery and include() | 11 11 3.9s From worker 4: Unhandled Task ERROR: EOFError: read end of file From worker 4: Stacktrace: From worker 4: [1] wait From worker 4: @ .\asyncevent.jl:159 [inlined] From worker 4: [2] sleep(sec::Float64) From worker 4: @ Base .\asyncevent.jl:265 From worker 4: [3] (::DistributedNext.var"#34#37"{DistributedNext.Worker, Float64})() From worker 4: @ DistributedNext D:\a\DistributedNext.jl\DistributedNext.jl\src\cluster.jl:213 ```
Alrighty, implemented worker statuses in 64aba00. Now they'll be passed to the worker-exited callbacks. Apologies for how big this PR is becoming 😅 I've tried to keep the commits atomic so you can review them one-by-one. |
Hmm, interestingly 0d5aaa3 seems to have almost entirely fixed #6. There are no more timeouts on Linux/OSX and I see only one on Windows. The common problem with these hangs seems to be lingering tasks blocking Julia from exiting. At some point we should probably audit Distributed[Next] to remove all of them. |
There's a few changes in here, I would recommend reviewing each commit individually. The big ones are:
Depends on timholy/Revise.jl#871.