Skip to content

Commit

Permalink
Replace a timeout task with timedwait()
Browse files Browse the repository at this point in the history
According to a stacktrace from a hung CI job this task was causing the process
to hang before exiting:
```julia
InterruptException()
_jl_mutex_unlock at C:/workdir/src\threading.c:1012
jl_mutex_unlock at C:/workdir/src\julia_locks.h:80 [inlined]
ijl_task_get_next at C:/workdir/src\scheduler.c:458
poptask at .\task.jl:1163
wait at .\task.jl:1172
task_done_hook at .\task.jl:839
jfptr_task_done_hook_98752.1 at C:\hostedtoolcache\windows\julia\nightly\x64\lib\julia\sys.dll (unknown line)
jl_apply at C:/workdir/src\julia.h:2233 [inlined]
jl_finish_task at C:/workdir/src\task.c:338
start_task at C:/workdir/src\task.c:1274
      From worker 82:	fatal: error thrown and no exception handler available.Unhandled Task ERROR: InterruptException:
Stacktrace:
 [1] poptask(W::Base.IntrusiveLinkedListSynchronized{Task})
   @ Base .\task.jl:1163
 [2] wait()
   @ Base .\task.jl:1172
 [3] wait(c::Base.GenericCondition{ReentrantLock}; first::Bool)
   @ Base .\condition.jl:141
 [4] wait
   @ .\condition.jl:136 [inlined]
 [5] put_buffered(c::Channel{Any}, v::Int64)
   @ Base .\channels.jl:420
 [6] put!(c::Channel{Any}, v::Int64)
   @ Base .\channels.jl:398
 [7] put!(rv::DistributedNext.RemoteValue, args::Int64)
   @ DistributedNext D:\a\DistributedNext.jl\DistributedNext.jl\src\remotecall.jl:703
 [8] (::DistributedNext.var"#create_worker##11#create_worker##12"{DistributedNext.RemoteValue, Float64})()
   @ DistributedNext D:\a\DistributedNext.jl\DistributedNext.jl\src\cluster.jl:721
```

Replaced it with a call to `timedwait()`, which has the advantage of being a lot
simpler than an extra task.
  • Loading branch information
JamesWrigley committed Dec 6, 2024
1 parent 90aba40 commit 5e98a05
Show file tree
Hide file tree
Showing 2 changed files with 7 additions and 10 deletions.
5 changes: 5 additions & 0 deletions docs/src/_changelog.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,11 @@ CurrentModule = DistributedNext
This documents notable changes in DistributedNext.jl. The format is based on
[Keep a Changelog](https://keepachangelog.com).

## Unreleased

### Fixed
- Fixed a cause of potential hangs when exiting the process ([#17]).

## [v1.0.0] - 2024-12-02

### Added
Expand Down
12 changes: 2 additions & 10 deletions src/cluster.jl
Original file line number Diff line number Diff line change
Expand Up @@ -712,17 +712,9 @@ function create_worker(manager, wconfig)
send_msg_now(w, MsgHeader(RRID(0,0), ntfy_oid), join_message)

errormonitor(@async manage(w.manager, w.id, w.config, :register))

# wait for rr_ntfy_join with timeout
timedout = false
errormonitor(
@async begin
sleep($timeout)
timedout = true
put!(rr_ntfy_join, 1)
end
)
wait(rr_ntfy_join)
if timedout
if timedwait(() -> isready(rr_ntfy_join), timeout) === :timed_out
error("worker did not connect within $timeout seconds")
end
lock(client_refs) do
Expand Down

0 comments on commit 5e98a05

Please sign in to comment.