Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Serve] Shared LongPollClient for Routers #48807

Merged
merged 27 commits into from
Jan 16, 2025

Conversation

JoshKarpel
Copy link
Contributor

@JoshKarpel JoshKarpel commented Nov 19, 2024

Why are these changes needed?

In our use case we use Ray Serve with many hundreds/thousands of apps, plus a "router" app that routes traffic to those apps using DeploymentHandles. Right now, that means we have a LongPollClient for each DeploymentHandle in each router app replica, which could be tens or hundreds of thousands of LongPollClients. This is expensive on both the Serve Controller and on the router app replicas. It can be particularly problematic in resource usage on the Serve Controller - the main thing blocking us from having as many router replicas as we'd like is the stability of the controller.

This PR aims to amortize this cost of having so many LongPollClients by going from one-long-poll-client-per-handle to one-long-poll-client-per-process. Each DeploymentHandle's Router now registers itself with a shared LongPollClient held by a singleton.

The actual implementation that I've gone with is a bit clunky because I'm trying to bridge the gap between the current solution and a design that only has shared LongPollClients. This could potentially be cleaned up in the future. Right now, each Router still gets a dedicated LongPollClient that only runs temporarily, until the shared client tells it to stop.

Related: #45957 is the same idea but for handle autoscaling metrics pushing.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@jcotant1 jcotant1 added the serve Ray Serve Related Issue label Nov 20, 2024
Signed-off-by: Josh Karpel <[email protected]>
@zcin
Copy link
Contributor

zcin commented Nov 27, 2024

@JoshKarpel I can't reproduce the test failure locally, could you try merging master and seeing if the failure is still there?

@zcin zcin self-requested a review December 11, 2024 23:56
@@ -79,7 +80,6 @@ def __init__(
key_listeners: Dict[KeyType, UpdateStateCallable],
call_in_event_loop: AbstractEventLoop,
) -> None:
assert len(key_listeners) > 0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case is now handled by https://github.com/ray-project/ray/pull/48807/files#diff-f138b21f7ddcd7d61c0b2704c8b828b9bbe7eb5021531e2c7fabeb20ec322e1aR280-R288 (and is necessary - when the shared client boots up for the first time it will send an RPC with no keys in it)

@@ -324,10 +324,16 @@ def make_nonblocking_calls(expected, expect_blocking=False):
make_nonblocking_calls({"2": 2})


def test_reconfigure_with_queries(serve_instance):
def test_reconfigure_does_not_run_while_there_are_active_queries(serve_instance):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried to de-flake this test 🤞🏻

@JoshKarpel JoshKarpel marked this pull request as ready for review December 16, 2024 19:19
@zcin
Copy link
Contributor

zcin commented Dec 21, 2024

@JoshKarpel with new routers waiting for its dedicated long poll client to make the first RPC call, do we still ever need to rely on the long poll timeout for anything?

@JoshKarpel
Copy link
Contributor Author

@JoshKarpel with new routers waiting for its dedicated long poll client to make the first RPC call, do we still ever need to rely on the long poll timeout for anything?

Oo, interesting question...

If the shared long poll never times out, it won't pick up new key listeners until some key it's already listening to has changed. So we'd converge to the same final state over the long term, but only if those existing keys get an update at some point, which isn't guaranteed (for example, in our setup, some Serve apps receive very little traffic and never need to autoscale - it could be hours or days between replica updates for them). I'd prefer to keep the timeout to ensure that there's a time-bound on how long it takes to get to the desired state.

# Conflicts:
#	python/ray/serve/_private/router.py
@zcin
Copy link
Contributor

zcin commented Jan 7, 2025

Oh sorry bad phrasing on my part, I meant do we still need to rely on the long poll timeout for routers to receive the correct updates from the controller, i.e. are routers forced to go through a delay before receiving updates in any situation (which would have been the case without the dedicated long poll client per-router).

(Definitely we should keep the long poll timeout!)

@JoshKarpel
Copy link
Contributor Author

Oh sorry bad phrasing on my part, I meant do we still need to rely on the long poll timeout for routers to receive the correct updates from the controller, i.e. are routers forced to go through a delay before receiving updates in any situation (which would have been the case without the dedicated long poll client per-router).

(Definitely we should keep the long poll timeout!)

Oh I see what you mean!

No, I don't think there's any situation where you get a delay now - the dedicated long poll client stays alive until the shared client tells it to stop, which only happens when the shared clients gets an update that includes that handle's keys, which means they must be in the shared client now.

But I guess that what I wrote above is still true:

So we'd converge to the same final state over the long term, but only if those existing keys get an update at some point, which isn't guaranteed (for example, in our setup, some Serve apps receive very little traffic and never need to autoscale - it could be hours or days between replica updates for them). I'd prefer to keep the timeout to ensure that there's a time-bound on how long it takes to get to the desired state.

The shared client doesn't take over until the handle's key gets an update here or here. So, especially for a rarely-changing deployment, there will be some time (potentially a long time) where both the dedicated and shared clients are running concurrently 🤔

... which might defeat the purpose of this, at least for us, because we've got a lot of apps that don't change much. Lemme think on this a bit more - this would be an incremental improvement for us still but maybe doesn't fully solve the load issue.

@JoshKarpel
Copy link
Contributor Author

But I guess that what I wrote above is still true:

So we'd converge to the same final state over the long term, but only if those existing keys get an update at some point, which isn't guaranteed (for example, in our setup, some Serve apps receive very little traffic and never need to autoscale - it could be hours or days between replica updates for them). I'd prefer to keep the timeout to ensure that there's a time-bound on how long it takes to get to the desired state.

The shared client doesn't take over until the handle's key gets an update here or here. So, especially for a rarely-changing deployment, there will be some time (potentially a long time) where both the dedicated and shared clients are running concurrently 🤔

... which might defeat the purpose of this, at least for us, because we've got a lot of apps that don't change much. Lemme think on this a bit more - this would be an incremental improvement for us still but maybe doesn't fully solve the load issue.

Ah, but I forgot that when the shared client sends its first request that includes the new key listeners, the snapshot id will be -1, so it will get an immediate changed response on those!

So my concern above is not real - the shared client will always take over on the next timeout cycle when it adds the new listeners.

@zcin
Copy link
Contributor

zcin commented Jan 7, 2025

... which might defeat the purpose of this, at least for us, because we've got a lot of apps that don't change much. Lemme think on this a bit more - this would be an incremental improvement for us still but maybe doesn't fully solve the load issue.

Hmm, I see. I think this will can cause a lot of overhead for the controller when apps are first deployed, but should still improve the performance once that first update is received, so that in steady state there aren't tons of separate long poll clients calling into the controller and repeatedly disconnecting/reconnecting.

However if that overhead is still a concern, have you tried implementing the cancel that we've discussed before? I am not 100% sure since I haven't implemented it myself, but I think using cancel is safe and will give us what we want. The long poll client uses obj_ref._on_completed for the callback, which should be invoked when the call is cancelled.

@zcin
Copy link
Contributor

zcin commented Jan 7, 2025

Ah, seems like we had a race condition of replying to each other 😅. Yes after the first update the shared client should take over, so if that removes concerns then I think the current implementation is fine.

@JoshKarpel
Copy link
Contributor Author

Ah, seems like we had a race condition of replying to each other 😅. Yes after the first update the shared client should take over, so if that removes concerns then I think the current implementation is fine.

Hah, yep! I am not concerned with the performance of the implementation as-is.

edoakes added a commit that referenced this pull request Jan 8, 2025
Small isolated PR to (hopefully) fix flakiness issues with
`python/ray/serve/tests/test_deploy.py::test_reconfigure_with_queries`,
noticed while working on #48807 and other PRs.

---------

Signed-off-by: Josh Karpel <[email protected]>
Co-authored-by: Edward Oakes <[email protected]>
# Conflicts:
#	python/ray/serve/tests/test_deploy.py
roshankathawate pushed a commit to roshankathawate/ray that referenced this pull request Jan 9, 2025
Small isolated PR to (hopefully) fix flakiness issues with
`python/ray/serve/tests/test_deploy.py::test_reconfigure_with_queries`,
noticed while working on ray-project#48807 and other PRs.

---------

Signed-off-by: Josh Karpel <[email protected]>
Co-authored-by: Edward Oakes <[email protected]>
Signed-off-by: Roshan Kathawate <[email protected]>
dayshah pushed a commit to dayshah/ray that referenced this pull request Jan 10, 2025
Small isolated PR to (hopefully) fix flakiness issues with
`python/ray/serve/tests/test_deploy.py::test_reconfigure_with_queries`,
noticed while working on ray-project#48807 and other PRs.

---------

Signed-off-by: Josh Karpel <[email protected]>
Co-authored-by: Edward Oakes <[email protected]>
HYLcool pushed a commit to HYLcool/ray that referenced this pull request Jan 13, 2025
Small isolated PR to (hopefully) fix flakiness issues with
`python/ray/serve/tests/test_deploy.py::test_reconfigure_with_queries`,
noticed while working on ray-project#48807 and other PRs.

---------

Signed-off-by: Josh Karpel <[email protected]>
Co-authored-by: Edward Oakes <[email protected]>
Signed-off-by: lielin.hyl <[email protected]>
@JoshKarpel
Copy link
Contributor Author

@zcin looks like I had a failing test from a merge conflict but that's resolved now - ready for another round of review

Comment on lines +742 to +744
for router in self.routers[deployment_id]:
router.update_deployment_targets(deployment_target_info)
router.long_poll_client.stop()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling stop here means that the router's own long poll client won't "stop" until after the next poll right? Since the change in the long poll host (that triggered this update) also triggered the router's long poll client.

When there is a lot of applications/deployments and controller is slowed down, will there be race conditions with multiple long poll clients updating the same state?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, the router will get both updates - I guess I was assuming that those updates are and will continue to be idempotent. Is that not the case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yes I believe that's true. Just want to make sure it's thought through carefully, since I haven't touched the long poll code before.

So the router's own long poll client will likely make 2 listen_for_change calls to the controller, but that is fine because the updates are idempotent (and they will time out after at most 30 seconds).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds right to me!

Copy link
Contributor

@zcin zcin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@zcin zcin added the go add ONLY when ready to merge, run all tests label Jan 15, 2025
@zcin zcin merged commit 7452bc6 into ray-project:master Jan 16, 2025
6 checks passed
@JoshKarpel JoshKarpel deleted the shared-long-poll-client branch January 16, 2025 03:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests serve Ray Serve Related Issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants