[Serve] Provide backpressure on handle metrics push #45776

JoshKarpel · 2024-06-06T17:55:34Z

Description

It would be nice to provide backpressure on handle metrics pushes to the Serve controller so that the controller does not become overloaded.

Relevant code is around these locations:

ray/python/ray/serve/_private/metrics_utils.py

Lines 48 to 73 in 9835610

    
               async def metrics_task(self, name: str): 
        
                   """Periodically runs `task_func` every `interval_s` until `stop_event` is set. 
        
                   If `task_func` raises an error, an exception will be logged. 
        
                   """ 
        
                   wait_for_stop_event = asyncio.create_task(self.stop_event.wait()) 
        
                   while True: 
        
                       if wait_for_stop_event.done(): 
        
                           return 
        
                       try: 
        
                           self._tasks[name].task_func() 
        
                       except Exception as e: 
        
                           logger.exception(f"Failed to run metrics task '{name}': {e}") 
        
                       sleep_task = asyncio.create_task( 
        
                           self._async_sleep(self._tasks[name].interval_s) 
        
                       ) 
        
                       await asyncio.wait( 
        
                           [sleep_task, wait_for_stop_event], 
        
                           return_when=asyncio.FIRST_COMPLETED, 
        
                       ) 
        
                       if not sleep_task.done(): 
        
                           sleep_task.cancel()

ray/python/ray/serve/_private/router.py

Lines 258 to 265 in 9835610

    
           self._controller_handle.record_handle_metrics.remote( 
        
               send_timestamp=time.time(), 
        
               deployment_id=self._deployment_id, 
        
               handle_id=self._handle_id, 
        
               actor_id=self._self_actor_id, 
        
               handle_source=self._handle_source, 
        
               **self._get_aggregated_requests(), 
        
           )

Currently the metrics push is fire-and-forget, and happens on a fixed interval whether or not the previous push has finished.

Use case

Our system is running a very large number of DeploymentHandles (see #44784 for more details). We've noticed that the Serve controller gets overloaded (>100% CPU usage) trying to accept all of the metrics pushes, which leads to an ever-increasing number of increasingly-stale record_handle_metrics tasks idle on the controller, which then eventually runs out of memory and crashes.

The text was updated successfully, but these errors were encountered:

JoshKarpel · 2024-06-26T21:11:18Z

@zcin FYI with #45957, I suspect this won't be necessary from a performance/scalability perspective, though it may be good design to do it anyway

JoshKarpel added enhancement Request for new feature and/or capability triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jun 6, 2024

zcin self-assigned this Jun 6, 2024

zcin added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jun 6, 2024

JoshKarpel mentioned this issue Jun 6, 2024

[Serve] Amortize handle metrics pushing by grouping metrics by process #45777

Open

anyscalesam added the serve Ray Serve Related Issue label Jun 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Serve] Provide backpressure on handle metrics push #45776

[Serve] Provide backpressure on handle metrics push #45776

JoshKarpel commented Jun 6, 2024 •

edited

Loading

JoshKarpel commented Jun 26, 2024

[Serve] Provide backpressure on handle metrics push #45776

[Serve] Provide backpressure on handle metrics push #45776

Comments

JoshKarpel commented Jun 6, 2024 • edited Loading

Description

Use case

JoshKarpel commented Jun 26, 2024

JoshKarpel commented Jun 6, 2024 •

edited

Loading