Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Serve] Provide backpressure on handle metrics push #45776

Open
JoshKarpel opened this issue Jun 6, 2024 · 1 comment
Open

[Serve] Provide backpressure on handle metrics push #45776

JoshKarpel opened this issue Jun 6, 2024 · 1 comment
Assignees
Labels
enhancement Request for new feature and/or capability P1 Issue that should be fixed within a few weeks serve Ray Serve Related Issue

Comments

@JoshKarpel
Copy link
Contributor

JoshKarpel commented Jun 6, 2024

Description

It would be nice to provide backpressure on handle metrics pushes to the Serve controller so that the controller does not become overloaded.

Relevant code is around these locations:

  • async def metrics_task(self, name: str):
    """Periodically runs `task_func` every `interval_s` until `stop_event` is set.
    If `task_func` raises an error, an exception will be logged.
    """
    wait_for_stop_event = asyncio.create_task(self.stop_event.wait())
    while True:
    if wait_for_stop_event.done():
    return
    try:
    self._tasks[name].task_func()
    except Exception as e:
    logger.exception(f"Failed to run metrics task '{name}': {e}")
    sleep_task = asyncio.create_task(
    self._async_sleep(self._tasks[name].interval_s)
    )
    await asyncio.wait(
    [sleep_task, wait_for_stop_event],
    return_when=asyncio.FIRST_COMPLETED,
    )
    if not sleep_task.done():
    sleep_task.cancel()
  • self._controller_handle.record_handle_metrics.remote(
    send_timestamp=time.time(),
    deployment_id=self._deployment_id,
    handle_id=self._handle_id,
    actor_id=self._self_actor_id,
    handle_source=self._handle_source,
    **self._get_aggregated_requests(),
    )

Currently the metrics push is fire-and-forget, and happens on a fixed interval whether or not the previous push has finished.

Use case

Our system is running a very large number of DeploymentHandles (see #44784 for more details). We've noticed that the Serve controller gets overloaded (>100% CPU usage) trying to accept all of the metrics pushes, which leads to an ever-increasing number of increasingly-stale record_handle_metrics tasks idle on the controller, which then eventually runs out of memory and crashes.

@JoshKarpel JoshKarpel added enhancement Request for new feature and/or capability triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jun 6, 2024
@zcin zcin self-assigned this Jun 6, 2024
@zcin zcin added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jun 6, 2024
@anyscalesam anyscalesam added the serve Ray Serve Related Issue label Jun 21, 2024
@JoshKarpel
Copy link
Contributor Author

@zcin FYI with #45957, I suspect this won't be necessary from a performance/scalability perspective, though it may be good design to do it anyway

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Request for new feature and/or capability P1 Issue that should be fixed within a few weeks serve Ray Serve Related Issue
Projects
None yet
Development

No branches or pull requests

3 participants