[Serve] Provide backpressure on handle metrics push #45776
Labels
enhancement
Request for new feature and/or capability
P1
Issue that should be fixed within a few weeks
serve
Ray Serve Related Issue
Description
It would be nice to provide backpressure on handle metrics pushes to the Serve controller so that the controller does not become overloaded.
Relevant code is around these locations:
ray/python/ray/serve/_private/metrics_utils.py
Lines 48 to 73 in 9835610
ray/python/ray/serve/_private/router.py
Lines 258 to 265 in 9835610
Currently the metrics push is fire-and-forget, and happens on a fixed interval whether or not the previous push has finished.
Use case
Our system is running a very large number of
DeploymentHandle
s (see #44784 for more details). We've noticed that the Serve controller gets overloaded (>100% CPU usage) trying to accept all of the metrics pushes, which leads to an ever-increasing number of increasingly-stalerecord_handle_metrics
tasks idle on the controller, which then eventually runs out of memory and crashes.The text was updated successfully, but these errors were encountered: