allow initiating peer to close stream gracefully #5

simeonschaub · 2024-10-30T09:52:50Z

I have a use case where I would like to remove a worker without killing it. Currently, trying to disconnect from the head node will cause the message handler loop to throw a fatal exception, so this adds a check that the connection is still open when trying to read new messages.

The test case is taken from https://github.com/simeonschaub/PersistentWorkers.jl, which I'd eventually like to register once this is merged.

I have a usecase where I would like to remove a worker without killing it. Currently, trying to disconnect from the head node will cause the message handler loop to throw a fatal exception, so this adds a check that the connection is still open when trying to read new messages.

codecov-commenter · 2024-10-30T10:08:26Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 25.73%. Comparing base (76df474) to head (719be61).
Report is 38 commits behind head on master.

Additional details and impacted files

@@             Coverage Diff             @@
##           master       #5       +/-   ##
===========================================
- Coverage   86.20%   25.73%   -60.48%     
===========================================
  Files          10       10               
  Lines        1965     1916       -49     
===========================================
- Hits         1694      493     -1201     
- Misses        271     1423     +1152

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

JamesWrigley · 2024-10-30T11:26:54Z

I'm surprised at the seemingly-random CI failures 🤔 Does this also happen with Distributed?

simeonschaub · 2024-10-30T16:17:50Z

Hmm, I had hard-to-reproduce failures with these tests before, but I thought I fixed everything. When I use persistent workers interactively at least I don't encounter any of these issues, so it seems to have something to do with the worker process being spawned from another Julia process like this.

Can you think of any way to test this more directly? I am almost tempted to go ahead with this without any test as the change itself is really quite simple but that of course makes it hard to ensure this functionality does not regress

JamesWrigley · 2024-10-30T21:31:41Z

I'm a bit hesitant to merge it directly given how common the failures are 😅 I'll try running the tests locally and see if I can reproduce them or come up with a simpler test. Could just be a bug in the tests rather than in the actual change.

JamesWrigley · 2024-10-31T12:37:06Z

I think this is a race condition caused by improper behaviour in DistributedNext.launch(::PersistentWorkerManager). If I read it correctly all the function does is assign some properties and then return without actually ensuring that the worker has been started. But Distributed/DistributedNext requires that the worker is running by the time launch() finishes because immediately afterwards the head node tries to connect to it, and that's what's causing the connection failures.

This happens in CI because the workers take an extra 1-2s to precompile, and that's more than the sleep(1) statement allows for. Adding a sleep(10) fixed it and you can see in the worker logs how long the precompilation took, but I think the real fix is to ensure that launch(::PersistentWorkerManager) only returns after the worker has started. And then we can also delete the sleep(1) call in the tests.

(feel free to delete/squash my debugging commits BTW, though I would recommend keeping in the calls I added to pipeline() and wait(worker).

JamesWrigley · 2024-10-31T12:41:53Z

test/persistent_workers.jl

+
+@testset "PersistentWorkers.jl" begin
+    cookie = randstring(16)
+    port = rand(9128:9999) # TODO: make sure port is available?


You could open a temporary server with Sockets.listenany() and get the port from that.

simeonschaub and others added 3 commits October 30, 2024 10:38

add integration tests

7bb2144

Distributed -> DistributedNext

35a095d

JamesWrigley added 4 commits October 31, 2024 13:13

fixup! add integration tests

8a58312

fixup! add integration tests

d3b6655

fixup! add integration tests

9bf5217

fixup! add integration tests

719be61

JamesWrigley requested changes Oct 31, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

allow initiating peer to close stream gracefully #5

allow initiating peer to close stream gracefully #5

simeonschaub commented Oct 30, 2024

codecov-commenter commented Oct 30, 2024 •

edited

Loading

JamesWrigley commented Oct 30, 2024

simeonschaub commented Oct 30, 2024

JamesWrigley commented Oct 30, 2024

JamesWrigley commented Oct 31, 2024

JamesWrigley Oct 31, 2024

allow initiating peer to close stream gracefully #5

Are you sure you want to change the base?

allow initiating peer to close stream gracefully #5

Conversation

simeonschaub commented Oct 30, 2024

codecov-commenter commented Oct 30, 2024 • edited Loading

Codecov Report

JamesWrigley commented Oct 30, 2024

simeonschaub commented Oct 30, 2024

JamesWrigley commented Oct 30, 2024

JamesWrigley commented Oct 31, 2024

JamesWrigley Oct 31, 2024

Choose a reason for hiding this comment

codecov-commenter commented Oct 30, 2024 •

edited

Loading