Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: make dcou partitioned steps not-skewed and cached #4452

Merged
merged 16 commits into from
Jan 30, 2025

Conversation

ryoqun
Copy link
Member

@ryoqun ryoqun commented Jan 14, 2025

Problem

dcou is taking too much time

Summary of Changes

patch upstream cargo hack a bit (todo: create an upstream pr...)

extracted from here: https://github.com/anza-xyz/agave/compare/master...ryoqun:dcou-partition-debug-wip?expand=1

Results

before:

5-15 mins and dcou 3/3 is always taking longer:

sample: https://buildkite.com/anza/agave/builds/17598:

image

after:

dcou jobs are now 5-10 mins (already i primed the machine-specific caches by repeated manual ci runs, lol), no longer dcou 3/3 isn't slowest. all jobs usually takes same time depending on cache hit rate:

sample: https://buildkite.com/anza/agave/builds/17833 (note that dcou 3/3 ran on machine with no prior dcou run by coincidence):

image

@ryoqun ryoqun requested a review from yihau January 14, 2025 06:11
@ryoqun ryoqun force-pushed the even-dcou-builds-wip branch 4 times, most recently from 8c95636 to c0978e9 Compare January 24, 2025 01:50
@ryoqun ryoqun marked this pull request as ready for review January 24, 2025 06:58
@@ -6,7 +6,7 @@ ARG \
RUST_NIGHTLY_VERSION= \
GOLANG_VERSION=1.21.3 \
NODE_MAJOR=18 \
SCCACHE_VERSION=v0.8.1 \
SCCACHE_VERSION=v0.9.1 \
Copy link
Member Author

@ryoqun ryoqun Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this newer version starts to show cache hit rates by --show-stats, which is handy to some extent.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bumped the version here: #4655

mkdir -p "$HOME/.cache/sccache-for-docker"
CONTAINER_HOME="/"
ARGS+=(
--volume "$HOME/.cache/sccache-for-docker:$CONTAINER_HOME/.cache/sccache"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fyi, /var/lib/buildkite/... didn't work well due to permission issue...

Copy link
Member

@yihau yihau Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you dm me the failed build link?

oh, if this one works, then let's use it

@@ -1,6 +1,29 @@
#!/usr/bin/env bash

set -eo pipefail
source ./ci/_

(unset RUSTC_WRAPPER; cargo install --force --git https://github.com/ryoqun/cargo-hack.git --branch interleaved-partition cargo-hack)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, should be moved to Dockerfile temporarily after anza's official cargo-hack fork is created...

I'm planning to upstream this pr to cargo-hack once after landing this pr to our master to see improvements. Currently, it's hard to see the effect of this cargo hack change by itself due to general inefficiency of caching...

@ryoqun ryoqun changed the title wip: ci: Distribute dcou builds more evenly ci: make dcou partitioned steps not-skewed and cached Jan 24, 2025
@ryoqun ryoqun force-pushed the even-dcou-builds-wip branch 3 times, most recently from 19f777e to 11ab9be Compare January 24, 2025 15:00
@ryoqun
Copy link
Member Author

ryoqun commented Jan 27, 2025

@yihau hey, this pr is ready for code-review. could you update our rust docker image with these changes if things looks acceptable to merge into master branch as an experiment? After that, i'll update the pr. Note that if the time reduce isn't so promising, i want to try local centralized redis server. Worse, I'm fine to revert this pr altogether after landing in a week or two if things went worse.

@ryoqun ryoqun force-pushed the even-dcou-builds-wip branch from 5709051 to 0202e1b Compare January 27, 2025 04:31
# extremely unrealistic for such diverting compilation behaviors to be desired
# as a sane use-case. So, just unset CI_COMMIT unconditionally to increase
# cache efficiency.
unset CI_COMMIT
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fyi, I put this unset CI_COMMIT here rather than ./ci/test-dev-context-only-utils.sh, because this is beneficial, regardless use of sccache.

@ryoqun ryoqun force-pushed the even-dcou-builds-wip branch from d133f8b to 7c33b9a Compare January 28, 2025 05:17
@yihau
Copy link
Member

yihau commented Jan 29, 2025

1 silly question. how much improvement do we get from this PR? I grab this 2 builds

the tip of master:
Screenshot 2025-01-29 at 11 01 41

the tip of this PR:
Screenshot 2025-01-29 at 11 02 10

I assume both of them hit the cache. looks like they have similar performance.

@ryoqun
Copy link
Member Author

ryoqun commented Jan 30, 2025

1 silly question. how much improvement do we get from this PR? I grab this 2 builds
...
I assume both of them hit the cache.

Actually, the sampled build with tip of this pr didn't hit the cache at all (https://buildkite.com/anza/agave/builds/18013#0194a7e5-eedf-4cb8-a8f0-3e0562ead889):

image

@ryoqun ryoqun force-pushed the even-dcou-builds-wip branch from 65c6ee3 to 239a55d Compare January 30, 2025 00:57
@ryoqun
Copy link
Member Author

ryoqun commented Jan 30, 2025

Also, note that hitting the cache is takes more time than cache miss with gcs as i said elsewhere. that's why this dcou 3-3 build took 15min 48 secs without pr, compared your sample of 12min 17secs:

https://buildkite.com/anza/agave/builds/18170#0194b11c-6cd2-4493-a63e-954f63e1e24a

Copy link
Member

@yihau yihau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you! let's do it 🫡

@ryoqun ryoqun merged commit 7345def into anza-xyz:master Jan 30, 2025
20 checks passed
This was referenced Jan 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants