Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bootstrap: Build jemalloc with support for 64K pages #135081

Merged
merged 1 commit into from
Jan 7, 2025

Conversation

mrkajetanp
Copy link
Contributor

@mrkajetanp mrkajetanp commented Jan 3, 2025

By default, jemalloc is built to only support the same page size as the host machine. Set an env variable so that jemalloc is built with support for page sizes up to 64K regardless of the host machine.

r? @Kobzol

Resolves #134563
Potentially resolves #133748 (needs verification)


Results from local rustc-perf testing below, within 0.5% on every metric except max-rss.
AArch64:
Screenshot 2025-01-03 at 5 53 13 pm
x86_64:
Screenshot 2025-01-03 at 5 54 16 pm

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-bootstrap Relevant to the bootstrap subteam: Rust's build system (x.py and src/bootstrap) labels Jan 3, 2025
@Kobzol
Copy link
Contributor

Kobzol commented Jan 3, 2025

@bors try @rust-timer queue

@rust-timer

This comment has been minimized.

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Jan 3, 2025
bors added a commit to rust-lang-ci/rust that referenced this pull request Jan 3, 2025
bootstrap: Build jemalloc with support for 64K pages

By default, jemalloc is built to only support the same page size as the host machine. Set an env variable so that jemalloc is built with support for page sizes up to 64K regardless of the host machine.

r? `@Kobzol`

Resolves rust-lang#134563
Potentially resolves rust-lang#133748 (needs verification)

----

Results from local rustc-perf testing below, within 0.5% on every metric except max-rss.
AArch64:
![Screenshot 2025-01-03 at 5 53 13 pm](https://github.com/user-attachments/assets/71705c59-7d7b-4753-a184-8c784233e603)
x86_64:
![Screenshot 2025-01-03 at 5 54 16 pm](https://github.com/user-attachments/assets/ea28aded-3b90-43f4-a965-b081b07b95ab)
@bors
Copy link
Contributor

bors commented Jan 3, 2025

⌛ Trying commit c0a1eb0 with merge 4402a4e...

@jieyouxu
Copy link
Member

jieyouxu commented Jan 3, 2025

Is it remotely possible to have some kind of regression test for this?

@bors
Copy link
Contributor

bors commented Jan 3, 2025

☀️ Try build successful - checks-actions
Build commit: 4402a4e (4402a4e67f6dca0636f58978a204b779123b0360)

@rust-timer

This comment has been minimized.

@Mark-Simulacrum
Copy link
Member

If we see a similar max-rss regression, I wonder if it would be worth comparing against a mimalloc compilation -- if that doesn't have the same 64k/4k page size issues, iirc the max-rss regression from switching to it was of similar magnitude?

@Kobzol
Copy link
Contributor

Kobzol commented Jan 3, 2025

I'm just running a mimalloc benchmark here :)

If we see max RSS regressions here, I would just do the 64 KiB switch only on aarch64.

@Mark-Simulacrum
Copy link
Member

Mark-Simulacrum commented Jan 3, 2025

Doesn't that "unfairly" hurt non-64k page sizes on other platforms (i.e., those with a 4k page size but still aarch64)? Also, should we try 16k pages as the limit in jemalloc (I guess that's what we see in practice if I'm reading the issue reports right?)

@mrkajetanp
Copy link
Contributor Author

Also, should we try 16k pages as the limit in jemalloc (I guess that's what we see in practice if I'm reading the issue reports right?)

If we do that we'll end up with similar reports for 64K page systems soon enough, they're just not very common at the moment. Ubuntu is already shipping 64K page AArch64 server images for instance.

@Kobzol
Copy link
Contributor

Kobzol commented Jan 3, 2025

Doesn't that "unfairly" hurt non-64k page sizes on other platforms (i.e., those with a 4k page size but still aarch64)?

It probably does, but if I understood the jemalloc issue, it's kind of how jemalloc is set up to work. So the decision is whether to support all page sizes or to have slightly better perf for 4k pages.

For aarch64, I'd personally use the general solution, but for x64, I'd stick with 4k, since we haven't received complaints about that so far. But we'll see how perf. turns out.

@mrkajetanp
Copy link
Contributor Author

Yeah that's the sensible approach I'd say. Iiuc x86 doesn't do 4K+ pages so this may not be an issue for x86 in the same way.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (4402a4e): comparison URL.

Overall result: ❌✅ regressions and improvements - please read the text below

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: -S-waiting-on-perf +perf-regression

Instruction count

This is the most reliable metric that we have; it was used to determine the overall result at the top of this comment. However, even this metric can sometimes exhibit noise.

mean range count
Regressions ❌
(primary)
0.3% [0.2%, 0.7%] 29
Regressions ❌
(secondary)
0.4% [0.1%, 1.5%] 61
Improvements ✅
(primary)
-0.3% [-0.6%, -0.1%] 43
Improvements ✅
(secondary)
-0.3% [-0.7%, -0.1%] 16
All ❌✅ (primary) -0.0% [-0.6%, 0.7%] 72

Max RSS (memory usage)

Results (primary 4.9%, secondary 4.8%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
4.9% [1.0%, 12.8%] 244
Regressions ❌
(secondary)
5.2% [0.9%, 15.1%] 207
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
-3.7% [-8.9%, -1.3%] 9
All ❌✅ (primary) 4.9% [1.0%, 12.8%] 244

Cycles

Results (primary -1.0%, secondary 1.2%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
2.8% [1.5%, 3.9%] 19
Improvements ✅
(primary)
-1.0% [-1.5%, -0.7%] 8
Improvements ✅
(secondary)
-6.3% [-9.3%, -3.0%] 4
All ❌✅ (primary) -1.0% [-1.5%, -0.7%] 8

Binary size

This benchmark run did not return any relevant results for this metric.

Bootstrap: 762.379s -> 759.643s (-0.36%)
Artifact size: 325.63 MiB -> 325.03 MiB (-0.18%)

@rustbot rustbot added perf-regression Performance regression. and removed S-waiting-on-perf Status: Waiting on a perf run to be completed. labels Jan 3, 2025
@Mark-Simulacrum
Copy link
Member

The binary size wins feel... surprising? Why is a larger page size so much smaller?

In any case, the regressions feel pretty reasonable -- from what I can tell they're mostly in smaller rss workloads, where the jump is likely more significant. I also wonder if we're seeing poor usage of the allocator's internal storage due to rustc_arena's preference for 4kb pages (

const PAGE: usize = 4096;
) -- maybe with jemalloc configured to 64k that needs to also be bumped up? I'd guess that the thresholds for falling back to mmap change the larger the page bucket is?

In any case, I'd be happy to land this personally; mimalloc looks worse to me in the results just gathered.

r=me but will leave this for a bit in case others have thoughts :)

@Kobzol
Copy link
Contributor

Kobzol commented Jan 4, 2025

Regarding the binary size: I did a few experiments with jemalloc and found out that some global variables (like arena_emap_global) become much smaller with a larger page size. I think that it's caused by the fact that the page index has less entries, because each page is larger, which makes sense.

Could you try to bump the page size of rustc_arena to 64 KiB to see the perf. effect, just out of curiosity?

Other than that, the results aren't terrible. Messing with page size can have large effects on performance (https://kobzol.github.io/rust/rustc/2023/10/21/make-rust-compiler-5percent-faster.html), typically at the cost of max RSS. I would still only enable this for aarch64 to keep the status quo though.

@Noratrieb
Copy link
Member

I think we should not do this on x86-64 and other non-aarch64 targets. The Max RSS regressions are not nice, but on aarch64 we need to do, so we need to accept them. But for other targets, I don't see a reason to.

@mrkajetanp
Copy link
Contributor Author

Restricted to AArch64 & now picks the value up from the environment if one is already set.

@rust-log-analyzer

This comment has been minimized.

@Kobzol
Copy link
Contributor

Kobzol commented Jan 6, 2025

Thanks! Let's check that x64 is unaffected. I'll do the rustc-arena 64 KiB page test in a different PR.

@bors try @rust-timer queue

@rust-timer

This comment has been minimized.

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Jan 6, 2025
@bors
Copy link
Contributor

bors commented Jan 6, 2025

⌛ Trying commit 83ff26d with merge 5917c7b...

bors added a commit to rust-lang-ci/rust that referenced this pull request Jan 6, 2025
bootstrap: Build jemalloc with support for 64K pages

By default, jemalloc is built to only support the same page size as the host machine. Set an env variable so that jemalloc is built with support for page sizes up to 64K regardless of the host machine.

r? `@Kobzol`

Resolves rust-lang#134563
Potentially resolves rust-lang#133748 (needs verification)

----

Results from local rustc-perf testing below, within 0.5% on every metric except max-rss.
AArch64:
![Screenshot 2025-01-03 at 5 53 13 pm](https://github.com/user-attachments/assets/71705c59-7d7b-4753-a184-8c784233e603)
x86_64:
![Screenshot 2025-01-03 at 5 54 16 pm](https://github.com/user-attachments/assets/ea28aded-3b90-43f4-a965-b081b07b95ab)
@bors
Copy link
Contributor

bors commented Jan 6, 2025

☀️ Try build successful - checks-actions
Build commit: 5917c7b (5917c7b5c110267b8752f94554baf42831f98128)

@rust-timer

This comment has been minimized.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (5917c7b): comparison URL.

Overall result: ❌ regressions - no action needed

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

@bors rollup=never
@rustbot label: -S-waiting-on-perf -perf-regression

Instruction count

This is the most reliable metric that we have; it was used to determine the overall result at the top of this comment. However, even this metric can sometimes exhibit noise.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
0.4% [0.3%, 0.4%] 2
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) - - 0

Max RSS (memory usage)

Results (primary -0.7%, secondary -2.5%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
1.1% [1.1%, 1.1%] 1
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
-2.6% [-2.6%, -2.6%] 1
Improvements ✅
(secondary)
-2.5% [-2.9%, -2.1%] 2
All ❌✅ (primary) -0.7% [-2.6%, 1.1%] 2

Cycles

Results (primary 1.0%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
1.0% [1.0%, 1.0%] 1
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 1.0% [1.0%, 1.0%] 1

Binary size

This benchmark run did not return any relevant results for this metric.

Bootstrap: 763.649s -> 764.164s (0.07%)
Artifact size: 325.68 MiB -> 325.71 MiB (0.01%)

@rustbot rustbot removed S-waiting-on-perf Status: Waiting on a perf run to be completed. perf-regression Performance regression. labels Jan 6, 2025
@Kobzol
Copy link
Contributor

Kobzol commented Jan 6, 2025

Good!

Sorry, realized one more thing - could you please add a comment on top of the added if condition to explain why is it there (with a link to this PR or the original issue)? Bootstrap is full of one-off hacks like this, and it's always a good idea to at least document why is some custom code there.

By default, jemalloc is built to only support the same page size as the
host machine. For AArch64 targets, set an env variable so that jemalloc
is built with support for page sizes up to 64K regardless of the host machine.
@mrkajetanp
Copy link
Contributor Author

No problem, done

@Kobzol
Copy link
Contributor

Kobzol commented Jan 7, 2025

Thanks! We can try to reapply the ARM dist builder PRs after this change has landed.

@bors r+ rollup

@bors
Copy link
Contributor

bors commented Jan 7, 2025

📌 Commit 53a5857 has been approved by Kobzol

It is now in the queue for this repository.

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jan 7, 2025
@mrkajetanp
Copy link
Contributor Author

We can try to reapply the ARM dist builder PRs after this change has landed.

Indeed :) should I repost it as a new PR or can bors just reapply it from the old one?

@Kobzol
Copy link
Contributor

Kobzol commented Jan 7, 2025

We'll need two new PRs (one for the dist-arm runner and other for the arm64 optimized runner).

bors added a commit to rust-lang-ci/rust that referenced this pull request Jan 7, 2025
…llaumeGomez

Rollup of 9 pull requests

Successful merges:

 - rust-lang#135081 (bootstrap: Build jemalloc with support for 64K pages)
 - rust-lang#135174 ([AIX] Port test case run-make/reproducible-build )
 - rust-lang#135177 (llvm: Ignore error value that is always false)
 - rust-lang#135182 (Transmute from NonNull to pointer when elaborating a box deref (MCP807))
 - rust-lang#135187 (apply a workaround fix for the release roadblock)
 - rust-lang#135189 (Remove workaround from pull request template)
 - rust-lang#135193 (don't bless `proc_macro_deps.rs` unless it's necessary)
 - rust-lang#135198 (Avoid naming variables `str`)
 - rust-lang#135199 (Eliminate an unnecessary `Symbol::to_string`; use `as_str`)

r? `@ghost`
`@rustbot` modify labels: rollup
@bors bors merged commit 2b97db2 into rust-lang:master Jan 7, 2025
6 checks passed
@rustbot rustbot added this to the 1.86.0 milestone Jan 7, 2025
rust-timer added a commit to rust-lang-ci/rust that referenced this pull request Jan 7, 2025
Rollup merge of rust-lang#135081 - mrkajetanp:jemalloc-64k, r=Kobzol

bootstrap: Build jemalloc with support for 64K pages

By default, jemalloc is built to only support the same page size as the host machine. Set an env variable so that jemalloc is built with support for page sizes up to 64K regardless of the host machine.

r? `@Kobzol`

Resolves rust-lang#134563
Potentially resolves rust-lang#133748 (needs verification)

----

Results from local rustc-perf testing below, within 0.5% on every metric except max-rss.
AArch64:
![Screenshot 2025-01-03 at 5 53 13 pm](https://github.com/user-attachments/assets/71705c59-7d7b-4753-a184-8c784233e603)
x86_64:
![Screenshot 2025-01-03 at 5 54 16 pm](https://github.com/user-attachments/assets/ea28aded-3b90-43f4-a965-b081b07b95ab)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-bootstrap Relevant to the bootstrap subteam: Rust's build system (x.py and src/bootstrap)
Projects
None yet
10 participants