-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Serve] Shared LongPollClient
for Router
s
#48807
Conversation
Signed-off-by: Josh Karpel <[email protected]>
Signed-off-by: Josh Karpel <[email protected]>
Signed-off-by: Josh Karpel <[email protected]>
Signed-off-by: Josh Karpel <[email protected]>
Signed-off-by: Josh Karpel <[email protected]>
Signed-off-by: Josh Karpel <[email protected]>
Signed-off-by: Josh Karpel <[email protected]>
@JoshKarpel I can't reproduce the test failure locally, could you try merging master and seeing if the failure is still there? |
# Conflicts: # python/ray/serve/_private/router.py
Signed-off-by: Josh Karpel <[email protected]>
Signed-off-by: Josh Karpel <[email protected]>
Signed-off-by: Josh Karpel <[email protected]>
Signed-off-by: Josh Karpel <[email protected]>
Signed-off-by: Josh Karpel <[email protected]>
Signed-off-by: Josh Karpel <[email protected]>
Signed-off-by: Josh Karpel <[email protected]>
@@ -79,7 +80,6 @@ def __init__( | |||
key_listeners: Dict[KeyType, UpdateStateCallable], | |||
call_in_event_loop: AbstractEventLoop, | |||
) -> None: | |||
assert len(key_listeners) > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This case is now handled by https://github.com/ray-project/ray/pull/48807/files#diff-f138b21f7ddcd7d61c0b2704c8b828b9bbe7eb5021531e2c7fabeb20ec322e1aR280-R288 (and is necessary - when the shared client boots up for the first time it will send an RPC with no keys in it)
@@ -324,10 +324,16 @@ def make_nonblocking_calls(expected, expect_blocking=False): | |||
make_nonblocking_calls({"2": 2}) | |||
|
|||
|
|||
def test_reconfigure_with_queries(serve_instance): | |||
def test_reconfigure_does_not_run_while_there_are_active_queries(serve_instance): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried to de-flake this test 🤞🏻
@JoshKarpel with new routers waiting for its dedicated long poll client to make the first RPC call, do we still ever need to rely on the long poll timeout for anything? |
Oo, interesting question... If the shared long poll never times out, it won't pick up new key listeners until some key it's already listening to has changed. So we'd converge to the same final state over the long term, but only if those existing keys get an update at some point, which isn't guaranteed (for example, in our setup, some Serve apps receive very little traffic and never need to autoscale - it could be hours or days between replica updates for them). I'd prefer to keep the timeout to ensure that there's a time-bound on how long it takes to get to the desired state. |
# Conflicts: # python/ray/serve/_private/router.py
Oh sorry bad phrasing on my part, I meant do we still need to rely on the long poll timeout for routers to receive the correct updates from the controller, i.e. are routers forced to go through a delay before receiving updates in any situation (which would have been the case without the dedicated long poll client per-router). (Definitely we should keep the long poll timeout!) |
Oh I see what you mean! No, I don't think there's any situation where you get a delay now - the dedicated long poll client stays alive until the shared client tells it to stop, which only happens when the shared clients gets an update that includes that handle's keys, which means they must be in the shared client now. But I guess that what I wrote above is still true:
The shared client doesn't take over until the handle's key gets an update here or here. So, especially for a rarely-changing deployment, there will be some time (potentially a long time) where both the dedicated and shared clients are running concurrently 🤔 ... which might defeat the purpose of this, at least for us, because we've got a lot of apps that don't change much. Lemme think on this a bit more - this would be an incremental improvement for us still but maybe doesn't fully solve the load issue. |
Ah, but I forgot that when the shared client sends its first request that includes the new key listeners, the snapshot id will be So my concern above is not real - the shared client will always take over on the next timeout cycle when it adds the new listeners. |
Hmm, I see. I think this will can cause a lot of overhead for the controller when apps are first deployed, but should still improve the performance once that first update is received, so that in steady state there aren't tons of separate long poll clients calling into the controller and repeatedly disconnecting/reconnecting. However if that overhead is still a concern, have you tried implementing the cancel that we've discussed before? I am not 100% sure since I haven't implemented it myself, but I think using cancel is safe and will give us what we want. The long poll client uses |
Ah, seems like we had a race condition of replying to each other 😅. Yes after the first update the shared client should take over, so if that removes concerns then I think the current implementation is fine. |
Hah, yep! I am not concerned with the performance of the implementation as-is. |
Small isolated PR to (hopefully) fix flakiness issues with `python/ray/serve/tests/test_deploy.py::test_reconfigure_with_queries`, noticed while working on #48807 and other PRs. --------- Signed-off-by: Josh Karpel <[email protected]> Co-authored-by: Edward Oakes <[email protected]>
# Conflicts: # python/ray/serve/tests/test_deploy.py
Small isolated PR to (hopefully) fix flakiness issues with `python/ray/serve/tests/test_deploy.py::test_reconfigure_with_queries`, noticed while working on ray-project#48807 and other PRs. --------- Signed-off-by: Josh Karpel <[email protected]> Co-authored-by: Edward Oakes <[email protected]> Signed-off-by: Roshan Kathawate <[email protected]>
Small isolated PR to (hopefully) fix flakiness issues with `python/ray/serve/tests/test_deploy.py::test_reconfigure_with_queries`, noticed while working on ray-project#48807 and other PRs. --------- Signed-off-by: Josh Karpel <[email protected]> Co-authored-by: Edward Oakes <[email protected]>
Small isolated PR to (hopefully) fix flakiness issues with `python/ray/serve/tests/test_deploy.py::test_reconfigure_with_queries`, noticed while working on ray-project#48807 and other PRs. --------- Signed-off-by: Josh Karpel <[email protected]> Co-authored-by: Edward Oakes <[email protected]> Signed-off-by: lielin.hyl <[email protected]>
Signed-off-by: Josh Karpel <[email protected]>
@zcin looks like I had a failing test from a merge conflict but that's resolved now - ready for another round of review |
for router in self.routers[deployment_id]: | ||
router.update_deployment_targets(deployment_target_info) | ||
router.long_poll_client.stop() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling stop
here means that the router's own long poll client won't "stop" until after the next poll right? Since the change in the long poll host (that triggered this update) also triggered the router's long poll client.
When there is a lot of applications/deployments and controller is slowed down, will there be race conditions with multiple long poll clients updating the same state?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, the router will get both updates - I guess I was assuming that those updates are and will continue to be idempotent. Is that not the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm yes I believe that's true. Just want to make sure it's thought through carefully, since I haven't touched the long poll code before.
So the router's own long poll client will likely make 2 listen_for_change
calls to the controller, but that is fine because the updates are idempotent (and they will time out after at most 30 seconds).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds right to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Why are these changes needed?
In our use case we use Ray Serve with many hundreds/thousands of apps, plus a "router" app that routes traffic to those apps using
DeploymentHandle
s. Right now, that means we have aLongPollClient
for eachDeploymentHandle
in each router app replica, which could be tens or hundreds of thousands ofLongPollClient
s. This is expensive on both the Serve Controller and on the router app replicas. It can be particularly problematic in resource usage on the Serve Controller - the main thing blocking us from having as many router replicas as we'd like is the stability of the controller.This PR aims to amortize this cost of having so many
LongPollClient
s by going from one-long-poll-client-per-handle to one-long-poll-client-per-process. EachDeploymentHandle
'sRouter
now registers itself with a sharedLongPollClient
held by a singleton.The actual implementation that I've gone with is a bit clunky because I'm trying to bridge the gap between the current solution and a design that only has shared
LongPollClient
s. This could potentially be cleaned up in the future. Right now, eachRouter
still gets a dedicatedLongPollClient
that only runs temporarily, until the shared client tells it to stop.Related: #45957 is the same idea but for handle autoscaling metrics pushing.
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.