-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Serve] Group DeploymentHandle
autoscaling metrics pushes by process
#45957
base: master
Are you sure you want to change the base?
[Serve] Group DeploymentHandle
autoscaling metrics pushes by process
#45957
Conversation
Signed-off-by: Josh Karpel <[email protected]>
Signed-off-by: Josh Karpel <[email protected]>
…shing-v2 Signed-off-by: Josh Karpel <[email protected]>
Signed-off-by: Josh Karpel <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to understand a bit more, some nit comments.
Signed-off-by: Josh Karpel <[email protected]>
DeploymentHandle
autoscaling metrics pushes by processDeploymentHandle
autoscaling metrics pushes by process
) | ||
shared.register(self) | ||
logger.info( | ||
f"Registered {self._handle_id} with shared metrics pusher {shared}." | ||
) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand this else
block - it seems like even if RAY_SERVE_COLLECT_AUTOSCALING_METRICS_ON_HANDLE = False
we still register the pushing task? It just won't have any data because the _add_autoscaling_metrics_point
task isn't registered? 🤔
self.metrics_pusher.register_or_update_task( | ||
self.PUSH_METRICS_TO_CONTROLLER_TASK_NAME, | ||
self.push_autoscaling_metrics_to_controller, | ||
autoscaling_config.metrics_interval_s, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zcin @edoakes I'm working on resurrecting this PR and I think this was the main sticking point - the shared metrics pusher wouldn't respect the autoscaling_config.metrics_interval_s
in the current form of the PR. Can we deprecate that parameter? Would we need to feature-flag this PR so it can be introduced gradually? Or maybe we want to preserve it and have a different shared pusher for each metric_interval_s
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine with making the metrics_interval_s
a cluster-level option configured via env var. WDYT Cindy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah! I like that option!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I think making the metrics interval a cluster level option makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good - I will take that approach when I get back from the holidays
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, it turns out there already is an env var for this, HANDLE_METRIC_PUSH_INTERVAL_S
ray/python/ray/serve/_private/constants.py
Lines 148 to 151 in 4742373
# Handle metric push interval. (This interval will affect the cold start time period) | |
HANDLE_METRIC_PUSH_INTERVAL_S = float( | |
os.environ.get("RAY_SERVE_HANDLE_METRIC_PUSH_INTERVAL_S", "10") | |
) |
ray/python/ray/serve/_private/router.py
Line 350 in 86da5fa
HANDLE_METRIC_PUSH_INTERVAL_S, |
The autoscaling_config.metrics_interval_s
is also used in a variety of other places, like for controlling how often metrics are recorded
ray/python/ray/serve/_private/router.py
Lines 200 to 203 in 86da5fa
min( | |
RAY_SERVE_HANDLE_AUTOSCALING_METRIC_RECORD_PERIOD_S, | |
autoscaling_config.metrics_interval_s, | |
), |
Should I remove all of those and just use the single env var, or keep the other uses and only "ignore" the setting here for the shared pusher?
I'm inclined to remove all of them for consistency but that's a much bigger blast radius...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should remove autoscaling_config.metrics_interval_s
and just use RAY_SERVE_HANDLE_AUTOSCALING_METRIC_RECORD_PERIOD_S
for recording metrics, because the interval at which we record metrics is usually more frequent than pushing metrics to the controller. And then we can use HANDLE_METRIC_PUSH_INTERVAL_S
, but perhaps renaming it to include "autoscaling" would be better going forward, if it is to replace autoscaling_config.metrics_interval_s
?
# Conflicts: # python/ray/serve/_private/router.py
## Why are these changes needed? In our use case we use Ray Serve with many hundreds/thousands of apps, plus a "router" app that routes traffic to those apps using `DeploymentHandle`s. Right now, that means we have a `LongPollClient` for each `DeploymentHandle` in each router app replica, which could be tens or hundreds of thousands of `LongPollClient`s. This is expensive on both the Serve Controller and on the router app replicas. It can be particularly problematic in resource usage on the Serve Controller - the main thing blocking us from having as many router replicas as we'd like is the stability of the controller. This PR aims to amortize this cost of having so many `LongPollClient`s by going from one-long-poll-client-per-handle to one-long-poll-client-per-process. Each `DeploymentHandle`'s `Router` now registers itself with a shared `LongPollClient` held by a singleton. The actual implementation that I've gone with is a bit clunky because I'm trying to bridge the gap between the current solution and a design that *only* has shared `LongPollClient`s. This could potentially be cleaned up in the future. Right now, each `Router` still gets a dedicated `LongPollClient` that only runs temporarily, until the shared client tells it to stop. Related: #45957 is the same idea but for handle autoscaling metrics pushing. <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number <!-- For example: "Closes #1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Josh Karpel <[email protected]>
## Why are these changes needed? In our use case we use Ray Serve with many hundreds/thousands of apps, plus a "router" app that routes traffic to those apps using `DeploymentHandle`s. Right now, that means we have a `LongPollClient` for each `DeploymentHandle` in each router app replica, which could be tens or hundreds of thousands of `LongPollClient`s. This is expensive on both the Serve Controller and on the router app replicas. It can be particularly problematic in resource usage on the Serve Controller - the main thing blocking us from having as many router replicas as we'd like is the stability of the controller. This PR aims to amortize this cost of having so many `LongPollClient`s by going from one-long-poll-client-per-handle to one-long-poll-client-per-process. Each `DeploymentHandle`'s `Router` now registers itself with a shared `LongPollClient` held by a singleton. The actual implementation that I've gone with is a bit clunky because I'm trying to bridge the gap between the current solution and a design that *only* has shared `LongPollClient`s. This could potentially be cleaned up in the future. Right now, each `Router` still gets a dedicated `LongPollClient` that only runs temporarily, until the shared client tells it to stop. Related: #45957 is the same idea but for handle autoscaling metrics pushing. <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number <!-- For example: "Closes #1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Josh Karpel <[email protected]>
Why are these changes needed?
We're seeing a lot of pressure on the Serve controller from metrics push tasks when running thousands of Serve apps. A lot of this pressure is purely from the overhead of lots of RPC connections incoming to the controller. We might be able to amortize this overhead (and presumably similar overhead in the handles too) by having the metrics push happen at the per-process level instead of the per-handle level.
We've made this change on our setup and it has reduced CPU time spent on this on the Controller, and also our ingress application replicas that have all the handles.
Related issue number
Closes #45777
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.