[Serve] Faster bulk imperative Serve Application deploys #49168

JoshKarpel · 2024-12-09T17:59:42Z

Why are these changes needed?

Our pattern of using Ray Serve has us deploying many hundreds/thousands of apps using the imperative API (serve.run). This ends up being very slow because the Controller needs to checkpoint as part of every RPC. It would be significantly more efficient to batch the deploys so that we can checkpoint fewer times.

This PR adds a new serve.run_many() public API, marked as developer-only, that can submit many applications to the Serve Controller in one RPC, with just a single checkpoint being saved after all of those applications are registered. The entire existing code path (including serve.run()) is refactored to be bulk operations under the hood (serve.run() calls serve.run_many()).

To further help with our particular use case, where the applications are being deployed from a controller that doesn't care about waiting for e.g. ingress deployment creation, the new code path also has fine-grained control over which things are waited for.

Just introducing a batch API isn't sufficient to actually provide a meaningful speedup. As mentioned above, the thing that is slow is the checkpointing, and right now, the checkpointing is very granular: the various stateful components checkpoint themselves at the bottom of the call stack, so even a single RPC might cause them to checkpoint multiple times right now.

Below I've tried to map out all the reasons that the Application/DeploymentStateManagers might checkpoint:

graph TD;
    deployment_state_set_target_state[DeploymentState._set_target_state] --> dsm_checkpoint[DeploymentStateManager._save_checkpoint_func]
    deployment_state_deploy[DeploymentState.deploy] --> deployment_state_set_target_state
    deployment_state_manager_deploy[DeploymentStateManager.deploy] --> deployment_state_deploy
    application_state_apply_deployment_info[ApplicationState.apply_deployment_info] --> deployment_state_manager_deploy
    application_state_reconcile_target_deployments[ApplicationState._reconcile_target_deployments] --x application_state_apply_deployment_info
    application_state_update[ApplicationState.update] --> application_state_reconcile_target_deployments
    application_state_manager_update[ApplicationStateManager.update] --x application_state_update
    serve_controller_run_control_loop[ServeController.run_control_loop] --> application_state_manager_update
    


    deployment_state_set_target_state_deleting[DeploymentState._set_target_state_deleting] --> dsm_checkpoint
    deployment_state_delete[DeploymentState.delete] --> deployment_state_set_target_state_deleting
    deployment_state_manager_delete_deployment[DeploymentStateManager.delete_deployment] --> deployment_state_delete
    application_state_delete_deployment[ApplicationState._delete_deployment] --> deployment_state_manager_delete_deployment
    application_state_reconcile_target_deployments --> application_state_delete_deployment

    deployment_state_autoscale[DeploymentState.autoscale] --> deployment_state_set_target_state
    deployment_state_manager_update[DeploymentStateManager.update] --> deployment_state_autoscale
    serve_controller_run_control_loop --> deployment_state_manager_update

    as_set_target_state[ApplicationState._set_target_state] --> asm_checkpoint[ApplicationStateManager._save_checkpoint_func]
 
 
 
 
as_recover_target_state_from_checkpoint[ApplicationState.recover_target_state_from_checkpoint] --> as_set_target_state
    asm_recover_from_checkpoint[ApplicationStateManager._recover_from_checkpoint] --> as_recover_target_state_from_checkpoint
    asm_init[ApplicationStateManager.__init__] --> asm_recover_from_checkpoint
    sc_init[ServeController.__init__] --> asm_init

    as_set_target_state_deleting[ApplicationState._set_target_state_deleting] --> as_set_target_state
    as_delete[ApplicationState.delete] --> as_set_target_state_deleting
    asm_delete_app[ApplicationStateManager.delete_app] --> as_delete
    sc_delete_apps[ServeController.delete_apps] --x asm_delete_app
    RPC --> sc_delete_apps

    as_clear_target_state_and_store_config[ApplicationState._clear_target_state_and_store_config] --> as_set_target_state
    as_apply_app_config[ApplicationState.apply_app_config] --> as_clear_target_state_and_store_config
    asm_apply_app_configs[ApplicationStateManager.apply_app_configs] --x as_apply_app_config
    sc_apply_config[ServeController.apply_config] --> asm_apply_app_configs
    RPC --> sc_apply_config


    as_deploy_app[ApplicationState.deploy_app] --> as_set_target_state
    asm_deploy_app[ApplicationStateManager.deploy_app] --> as_deploy_app
    sc_deploy_application[ServeController.deploy_application] --> asm_deploy_app
    RPC --> sc_deploy_application

    as_apply_app_config --> as_set_target_state

So, in addition to the batch API that the client sees, I've refactored where these checkpoints are done so that they happen at the top of those call stacks instead of at the bottom.

We still checkpoint before (now just before) returning an RPC that mutates state.
We still checkpoint after making any changes to internal state and before issuing any commands to the cluster to e.g. start/stop replicas (just not immediately after making the internal state change).

I did not change the EndpointState's checkpointing because it hasn't shown up in our flamegraphs.

Before these changes, deploying 5k Serve apps, each with one deployment, took >1 hour and would often never finish because the Serve Controller would become unresponsive and KubeRay would end up restarting the cluster.

With these changes, deploying 5k Serve apps with a batch size of 100 per API call only takes about 90 seconds!

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Josh Karpel <[email protected]>

…tate Signed-off-by: Josh Karpel <[email protected]>

JoshKarpel · 2024-12-12T01:31:59Z

doc/source/serve/api/index.md

+   serve.run_many
+   serve.RunTarget


Should these be here, or should they get sectioned off in some "beta APIs" section?

JoshKarpel · 2024-12-12T01:33:07Z

python/ray/serve/_private/application_state.py

+        live_route_prefixes: Dict[str, str] = {
+            app_state.route_prefix: app_name
+            for app_name, app_state in self._application_states.items()
+            if app_state.route_prefix is not None
+            and not app_state.status == ApplicationStatus.DELETING
+        }


This, along with https://github.com/ray-project/ray/pull/49168/files#diff-31975e75fad7092f05d8d5759dfeb90d8b454fc4193e4970b55914dbb87ad968R942-R946 , is an actual change that isn't just looping over existing code. For efficiency, creating this mapping is pulled up a level, and the mapping is updated as we go through all the apps we're deploying. Otherwise we'd need to loop over all existing apps for each new app in the batch.

JoshKarpel · 2024-12-12T15:22:03Z

python/ray/serve/api.py

    else:
        client = _private_api.serve_start(
            http_options={"location": "EveryNode"},
-            global_logging_config=logging_config,
+            global_logging_config=t.logging_config,  # implicitly uses the last target


I'm struggling with this a bit. Is the logging config intended to be per-app or Serve-wide? Should the global logging config be a separate parameter?

The logging config can be set globally and overridden per-app

In this case, it should be empty

JoshKarpel · 2024-12-16T20:49:43Z

python/ray/serve/_private/application_state.py

-        if writeahead_checkpoints is not None:
-            application_state_info.update(writeahead_checkpoints)


I removed the writeahead_checkpoints functionality after discussing with @edoakes - it sounds this is vestigial from when the Controller was much more async, and there's no functional reason to do it now (the checkpoint can happen after setting the target state, as long as no changes are made to the Ray cluster between them).

Signed-off-by: Josh Karpel <[email protected]>

JoshKarpel · 2024-12-18T15:39:30Z

Ready for a first look - I've left some questions in comments that need some spicier decisions made!

JoshKarpel · 2024-12-18T15:41:21Z

python/ray/serve/_private/build_app.py

+    route_prefix: Optional[str]
+    logging_config: Optional[LoggingConfig]


These are now part of the BuiltApplication to make it easier to group things up without adding an extra intermediate data structure (that would presumably have a BuiltApplication + route_prefix + logging_config).

edoakes · 2024-12-20T22:56:23Z

python/ray/serve/api.py

+    )[0]
+
+
+@PublicAPI(stability="beta")


Let's make this a DeveloperAPI -- I'm not comfortable maintaining it as a public API with the regular guarantees on stability (yet)

Oo, didn't realize that was a thing, can do!

edoakes · 2024-12-20T22:56:46Z

python/ray/serve/api.py

+    wait_for_ingress_deployment_creation: bool = True,
+    wait_for_applications_running: bool = True,


are these options necessary for your usage?

They're a dramatic speedup for us because otherwise the client would be waiting for the controller to actually create actors/those actors to be ready, which we check for separately on each of our control loop cycles. Not needing to wait means we can turn around and submit the next group of targets immediately instead of waiting for the next Serve Controller cycle (which is especially slow for us because we've had to increase the loop interval to give the controller more time to respond to requests, though we might be able to undo that with #48807 and https://github.com/ray-project/ray/pull/45957/files).

edoakes · 2024-12-20T22:56:59Z

python/ray/serve/tests/unit/test_deployment_state.py

+    # In real code this checkpoint would be done by the caller of .deploy()
+    dsm.save_checkpoint()


why's this needed?

Without this, no one actually saves the checkpoint that https://github.com/ray-project/ray/pull/49168/files#diff-41218c5a0b704f1be5080d99a1c21c89c421e12d28df768be56588ee7a53544dR2738 would try to use - with this PR, dsm.update() does not implicitly save a checkpoint; whoever is calling DeploymentStateManager.update() is responsible for that.

Signed-off-by: Josh Karpel <[email protected]>

JoshKarpel · 2025-01-14T18:15:48Z

@zcin @edoakes this could use another round of review when you get a chance 🙇🏻

zcin

Overall LGTM!

zcin · 2025-01-14T19:57:43Z

python/ray/serve/_private/controller.py

+                self.application_state_manager.save_checkpoint()
+                # ApplicationStateManager.update() can also mutate the
+                # DeploymentStateManager so we need to checkpoint that as well
+                self.deployment_state_manager.save_checkpoint()


Why do we need to save deployment_state_manager checkpoint twice?

This is re: https://github.com/ray-project/ray/pull/49168/files#diff-1b8f248dbc32098673b5329a85021c686d9e5f03b05c145c9354988280bf491fR404 ?

My thinking was that it's still best to checkpoint as soon as possible after making a batch of changes - but I'd be fine with moving these two checkpoints down to here-ish https://github.com/ray-project/ray/pull/49168/files#diff-1b8f248dbc32098673b5329a85021c686d9e5f03b05c145c9354988280bf491fR435 , after both update steps are complete.

zcin · 2025-01-14T20:14:32Z

python/ray/serve/tests/test_deploy.py

+    a2, pida2 = serve.get_app_handle("a").remote().result()
+
+    assert a1 == a2
+    assert pida1 == pida2


Could you also add a test to test_local_testing_mode.py?

Can do - anything in particular that I should be checking there, or just an equivalent of these tests but with _local_testing_mode=True in the run calls?

Signed-off-by: Josh Karpel <[email protected]>

JoshKarpel · 2025-01-21T20:36:24Z

python/ray/serve/tests/unit/test_pow_2_replica_scheduler.py

-    # Since we set up the `backoff_sequence_s` to be 999s, this 1s timeout will still
+    # Since we set up the `backoff_sequence_s` to be 999s, this 10s timeout will still
    # capture the extra delay if it was added between scheduling loop.
-    done, _ = await asyncio.wait([task], timeout=1)
+    done, _ = await asyncio.wait([task], timeout=10)


Got what looked like a flaky failure here on https://buildkite.com/ray-project/microcheck/builds/10134#01948a39-6ed6-48a0-9e2b-5e685c57a3f2 , presumably because this timeout is short and CI is slow

JoshKarpel added 2 commits December 9, 2024 11:52

add bulk version of serve.run

d8c1a17

Signed-off-by: Josh Karpel <[email protected]>

do not wait for deployments to be created before returning

cc9ac90

Signed-off-by: Josh Karpel <[email protected]>

jcotant1 added the serve Ray Serve Related Issue label Dec 9, 2024

JoshKarpel added 27 commits December 9, 2024 15:12

fix import

32206ab

Signed-off-by: Josh Karpel <[email protected]>

fix call

aa332e8

Signed-off-by: Josh Karpel <[email protected]>

bring back checkpointing for single changes

bd5c479

Signed-off-by: Josh Karpel <[email protected]>

fix blocking behavior in public API

04f9a37

Signed-off-by: Josh Karpel <[email protected]>

restore wait, fix API tags

0465f8d

Signed-off-by: Josh Karpel <[email protected]>

only wait for deployment creation if blocking

4e10d33

Signed-off-by: Josh Karpel <[email protected]>

fix import

b462413

Signed-off-by: Josh Karpel <[email protected]>

no *args

c5e94ad

Signed-off-by: Josh Karpel <[email protected]>

make route_prefix default to None in build_app

3786023

Signed-off-by: Josh Karpel <[email protected]>

guess we have to wait

2ab44de

Signed-off-by: Josh Karpel <[email protected]>

ah, blocking should not be passed down

c4fabcc

Signed-off-by: Josh Karpel <[email protected]>

fine-grained control of blocking

d43f98a

Signed-off-by: Josh Karpel <[email protected]>

Merge branch 'master' into batched-serve-run

48a6326

docs

74e0992

Signed-off-by: Josh Karpel <[email protected]>

fix local testing mode

6e87175

Signed-off-by: Josh Karpel <[email protected]>

fix docstring

acc9fbf

Signed-off-by: Josh Karpel <[email protected]>

rename

6e987b0

Signed-off-by: Josh Karpel <[email protected]>

tidy up diff

51cda35

Signed-off-by: Josh Karpel <[email protected]>

optimize checking for existing prefixes

c5eb119

Signed-off-by: Josh Karpel <[email protected]>

allow redeploying app with same prefix if same name

357113a

Signed-off-by: Josh Karpel <[email protected]>

woops

537b696

Signed-off-by: Josh Karpel <[email protected]>

add tests

827b26e

Signed-off-by: Josh Karpel <[email protected]>

Merge branch 'master' into batched-serve-run

471d085

also need to defer deployment state checkpointing

ad6c34e

Signed-off-by: Josh Karpel <[email protected]>

remove one more writeahead

2ab4d03

Signed-off-by: Josh Karpel <[email protected]>

move checkpointing further up

10c9fd9

Signed-off-by: Josh Karpel <[email protected]>

pull checkpoint up to top-level only for application and deployment s…

6726c13

…tate Signed-off-by: Josh Karpel <[email protected]>

JoshKarpel commented Dec 16, 2024

View reviewed changes

JoshKarpel added 4 commits December 17, 2024 09:47

Merge branch 'master' into batched-serve-run

3549013

literals

47c10d1

Signed-off-by: Josh Karpel <[email protected]>

fix test and stop writing checkpoints once shutdown is called

2a42e89

Signed-off-by: Josh Karpel <[email protected]>

Merge branch 'master' into batched-serve-run

087f9e6

JoshKarpel marked this pull request as ready for review December 18, 2024 15:38

JoshKarpel requested review from edoakes, zcin, GeneDer, akshay-anyscale and a team as code owners December 18, 2024 15:38

JoshKarpel commented Dec 18, 2024

View reviewed changes

edoakes reviewed Dec 20, 2024

View reviewed changes

JoshKarpel added 2 commits January 6, 2025 14:23

Merge branch 'master' into batched-serve-run

ec4525d

mark new functions as developer APIs and remove from docs

71cb770

Signed-off-by: Josh Karpel <[email protected]>

JoshKarpel changed the title ~~[Serve] Faster imperative Serve Application deploys~~ [Serve] Faster bulk imperative Serve Application deploys Jan 7, 2025

JoshKarpel requested a review from edoakes January 7, 2025 16:22

JoshKarpel added 2 commits January 7, 2025 15:27

Merge branch 'master' into batched-serve-run

db0ef46

Merge branch 'master' into batched-serve-run

434748f

Merge branch 'master' into batched-serve-run

12fa28f

zcin approved these changes Jan 14, 2025

View reviewed changes

JoshKarpel added 3 commits January 21, 2025 12:49

add local testing mode tests

d47b31f

Signed-off-by: Josh Karpel <[email protected]>

Merge branch 'master' into batched-serve-run

5317cc6

increase timeout to deflake test

a99d60d

Signed-off-by: Josh Karpel <[email protected]>

JoshKarpel commented Jan 21, 2025

View reviewed changes

JoshKarpel requested a review from zcin January 21, 2025 20:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Serve] Faster bulk imperative Serve Application deploys #49168

[Serve] Faster bulk imperative Serve Application deploys #49168

JoshKarpel commented Dec 9, 2024 •

edited

Loading

JoshKarpel Dec 12, 2024

JoshKarpel Dec 12, 2024

JoshKarpel Dec 12, 2024

edoakes Dec 20, 2024

JoshKarpel Dec 16, 2024

JoshKarpel commented Dec 18, 2024

JoshKarpel Dec 18, 2024

edoakes Dec 20, 2024

JoshKarpel Jan 6, 2025

edoakes Dec 20, 2024

JoshKarpel Jan 6, 2025

edoakes Dec 20, 2024

JoshKarpel Jan 6, 2025

JoshKarpel commented Jan 14, 2025

zcin left a comment

zcin Jan 14, 2025

JoshKarpel Jan 15, 2025

zcin Jan 14, 2025

JoshKarpel Jan 15, 2025

JoshKarpel Jan 21, 2025

		if writeahead_checkpoints is not None:
		application_state_info.update(writeahead_checkpoints)

		route_prefix: Optional[str]
		logging_config: Optional[LoggingConfig]

		wait_for_ingress_deployment_creation: bool = True,
		wait_for_applications_running: bool = True,

		# In real code this checkpoint would be done by the caller of .deploy()
		dsm.save_checkpoint()

		)[0]


		@PublicAPI(stability="beta")

[Serve] Faster bulk imperative Serve Application deploys #49168

Are you sure you want to change the base?

[Serve] Faster bulk imperative Serve Application deploys #49168

Conversation

JoshKarpel commented Dec 9, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JoshKarpel commented Dec 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JoshKarpel commented Jan 14, 2025

zcin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JoshKarpel commented Dec 9, 2024 •

edited

Loading