bootstrap: Build jemalloc with support for 64K pages #135081

mrkajetanp · 2025-01-03T18:10:45Z

By default, jemalloc is built to only support the same page size as the host machine. Set an env variable so that jemalloc is built with support for page sizes up to 64K regardless of the host machine.

r? @Kobzol

Resolves #134563
Potentially resolves #133748 (needs verification)

Results from local rustc-perf testing below, within 0.5% on every metric except max-rss.
AArch64:

x86_64:

Kobzol · 2025-01-03T18:16:37Z

@bors try @rust-timer queue

bootstrap: Build jemalloc with support for 64K pages By default, jemalloc is built to only support the same page size as the host machine. Set an env variable so that jemalloc is built with support for page sizes up to 64K regardless of the host machine. r? `@Kobzol` Resolves rust-lang#134563 Potentially resolves rust-lang#133748 (needs verification) ---- Results from local rustc-perf testing below, within 0.5% on every metric except max-rss. AArch64: ![Screenshot 2025-01-03 at 5 53 13 pm](https://github.com/user-attachments/assets/71705c59-7d7b-4753-a184-8c784233e603) x86_64: ![Screenshot 2025-01-03 at 5 54 16 pm](https://github.com/user-attachments/assets/ea28aded-3b90-43f4-a965-b081b07b95ab)

bors · 2025-01-03T18:17:48Z

⌛ Trying commit c0a1eb0 with merge 4402a4e...

jieyouxu · 2025-01-03T18:29:39Z

Is it remotely possible to have some kind of regression test for this?

bors · 2025-01-03T20:06:39Z

☀️ Try build successful - checks-actions
Build commit: 4402a4e (4402a4e67f6dca0636f58978a204b779123b0360)

Mark-Simulacrum · 2025-01-03T20:45:10Z

If we see a similar max-rss regression, I wonder if it would be worth comparing against a mimalloc compilation -- if that doesn't have the same 64k/4k page size issues, iirc the max-rss regression from switching to it was of similar magnitude?

Kobzol · 2025-01-03T20:54:04Z

I'm just running a mimalloc benchmark here :)

If we see max RSS regressions here, I would just do the 64 KiB switch only on aarch64.

Mark-Simulacrum · 2025-01-03T21:13:05Z

Doesn't that "unfairly" hurt non-64k page sizes on other platforms (i.e., those with a 4k page size but still aarch64)? Also, should we try 16k pages as the limit in jemalloc (I guess that's what we see in practice if I'm reading the issue reports right?)

mrkajetanp · 2025-01-03T21:17:49Z

Also, should we try 16k pages as the limit in jemalloc (I guess that's what we see in practice if I'm reading the issue reports right?)

If we do that we'll end up with similar reports for 64K page systems soon enough, they're just not very common at the moment. Ubuntu is already shipping 64K page AArch64 server images for instance.

Kobzol · 2025-01-03T21:24:17Z

Doesn't that "unfairly" hurt non-64k page sizes on other platforms (i.e., those with a 4k page size but still aarch64)?

It probably does, but if I understood the jemalloc issue, it's kind of how jemalloc is set up to work. So the decision is whether to support all page sizes or to have slightly better perf for 4k pages.

For aarch64, I'd personally use the general solution, but for x64, I'd stick with 4k, since we haven't received complaints about that so far. But we'll see how perf. turns out.

mrkajetanp · 2025-01-03T21:27:58Z

Yeah that's the sensible approach I'd say. Iiuc x86 doesn't do 4K+ pages so this may not be an issue for x86 in the same way.

rust-timer · 2025-01-03T23:51:19Z

Finished benchmarking commit (4402a4e): comparison URL.

Overall result: ❌✅ regressions and improvements - please read the text below

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: -S-waiting-on-perf +perf-regression

Instruction count

This is the most reliable metric that we have; it was used to determine the overall result at the top of this comment. However, even this metric can sometimes exhibit noise.

	mean	range	count
Regressions ❌ (primary)	0.3%	[0.2%, 0.7%]	29
Regressions ❌ (secondary)	0.4%	[0.1%, 1.5%]	61
Improvements ✅ (primary)	-0.3%	[-0.6%, -0.1%]	43
Improvements ✅ (secondary)	-0.3%	[-0.7%, -0.1%]	16
All ❌✅ (primary)	-0.0%	[-0.6%, 0.7%]	72

Max RSS (memory usage)

Results (primary 4.9%, secondary 4.8%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	4.9%	[1.0%, 12.8%]	244
Regressions ❌ (secondary)	5.2%	[0.9%, 15.1%]	207
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-3.7%	[-8.9%, -1.3%]	9
All ❌✅ (primary)	4.9%	[1.0%, 12.8%]	244

Cycles

Results (primary -1.0%, secondary 1.2%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	2.8%	[1.5%, 3.9%]	19
Improvements ✅ (primary)	-1.0%	[-1.5%, -0.7%]	8
Improvements ✅ (secondary)	-6.3%	[-9.3%, -3.0%]	4
All ❌✅ (primary)	-1.0%	[-1.5%, -0.7%]	8

Binary size

This benchmark run did not return any relevant results for this metric.

Bootstrap: 762.379s -> 759.643s (-0.36%)
Artifact size: 325.63 MiB -> 325.03 MiB (-0.18%)

Mark-Simulacrum · 2025-01-04T02:56:28Z

The binary size wins feel... surprising? Why is a larger page size so much smaller?

In any case, the regressions feel pretty reasonable -- from what I can tell they're mostly in smaller rss workloads, where the jump is likely more significant. I also wonder if we're seeing poor usage of the allocator's internal storage due to rustc_arena's preference for 4kb pages (

rust/compiler/rustc_arena/src/lib.rs

Line 111 in 8d2c06d

const PAGE: usize = 4096;

) -- maybe with jemalloc configured to 64k that needs to also be bumped up? I'd guess that the thresholds for falling back to mmap change the larger the page bucket is?

In any case, I'd be happy to land this personally; mimalloc looks worse to me in the results just gathered.

r=me but will leave this for a bit in case others have thoughts :)

src/bootstrap/src/core/build_steps/compile.rs

Kobzol · 2025-01-04T15:34:20Z

Regarding the binary size: I did a few experiments with jemalloc and found out that some global variables (like arena_emap_global) become much smaller with a larger page size. I think that it's caused by the fact that the page index has less entries, because each page is larger, which makes sense.

Could you try to bump the page size of rustc_arena to 64 KiB to see the perf. effect, just out of curiosity?

Other than that, the results aren't terrible. Messing with page size can have large effects on performance (https://kobzol.github.io/rust/rustc/2023/10/21/make-rust-compiler-5percent-faster.html), typically at the cost of max RSS. I would still only enable this for aarch64 to keep the status quo though.

Noratrieb · 2025-01-05T14:33:57Z

I think we should not do this on x86-64 and other non-aarch64 targets. The Max RSS regressions are not nice, but on aarch64 we need to do, so we need to accept them. But for other targets, I don't see a reason to.

mrkajetanp · 2025-01-06T13:14:37Z

Restricted to AArch64 & now picks the value up from the environment if one is already set.

Kobzol · 2025-01-06T16:34:06Z

Thanks! Let's check that x64 is unaffected. I'll do the rustc-arena 64 KiB page test in a different PR.

@bors try @rust-timer queue

bors · 2025-01-06T16:35:19Z

⌛ Trying commit 83ff26d with merge 5917c7b...

bootstrap: Build jemalloc with support for 64K pages By default, jemalloc is built to only support the same page size as the host machine. Set an env variable so that jemalloc is built with support for page sizes up to 64K regardless of the host machine. r? `@Kobzol` Resolves rust-lang#134563 Potentially resolves rust-lang#133748 (needs verification) ---- Results from local rustc-perf testing below, within 0.5% on every metric except max-rss. AArch64: ![Screenshot 2025-01-03 at 5 53 13 pm](https://github.com/user-attachments/assets/71705c59-7d7b-4753-a184-8c784233e603) x86_64: ![Screenshot 2025-01-03 at 5 54 16 pm](https://github.com/user-attachments/assets/ea28aded-3b90-43f4-a965-b081b07b95ab)

bors · 2025-01-06T18:22:54Z

☀️ Try build successful - checks-actions
Build commit: 5917c7b (5917c7b5c110267b8752f94554baf42831f98128)

rust-timer · 2025-01-06T20:54:23Z

Finished benchmarking commit (5917c7b): comparison URL.

Overall result: ❌ regressions - no action needed

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

@bors rollup=never
@rustbot label: -S-waiting-on-perf -perf-regression

Instruction count

This is the most reliable metric that we have; it was used to determine the overall result at the top of this comment. However, even this metric can sometimes exhibit noise.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	0.4%	[0.3%, 0.4%]	2
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	-	-	0

Max RSS (memory usage)

Results (primary -0.7%, secondary -2.5%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	1.1%	[1.1%, 1.1%]	1
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-2.6%	[-2.6%, -2.6%]	1
Improvements ✅ (secondary)	-2.5%	[-2.9%, -2.1%]	2
All ❌✅ (primary)	-0.7%	[-2.6%, 1.1%]	2

Cycles

Results (primary 1.0%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	1.0%	[1.0%, 1.0%]	1
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	1.0%	[1.0%, 1.0%]	1

Binary size

This benchmark run did not return any relevant results for this metric.

Bootstrap: 763.649s -> 764.164s (0.07%)
Artifact size: 325.68 MiB -> 325.71 MiB (0.01%)

Kobzol · 2025-01-06T20:56:21Z

Good!

Sorry, realized one more thing - could you please add a comment on top of the added if condition to explain why is it there (with a link to this PR or the original issue)? Bootstrap is full of one-off hacks like this, and it's always a good idea to at least document why is some custom code there.

By default, jemalloc is built to only support the same page size as the host machine. For AArch64 targets, set an env variable so that jemalloc is built with support for page sizes up to 64K regardless of the host machine.

mrkajetanp · 2025-01-07T11:26:37Z

No problem, done

Kobzol · 2025-01-07T11:51:20Z

Thanks! We can try to reapply the ARM dist builder PRs after this change has landed.

@bors r+ rollup

bors · 2025-01-07T11:51:23Z

📌 Commit 53a5857 has been approved by Kobzol

It is now in the queue for this repository.

mrkajetanp · 2025-01-07T11:52:43Z

We can try to reapply the ARM dist builder PRs after this change has landed.

Indeed :) should I repost it as a new PR or can bors just reapply it from the old one?

Kobzol · 2025-01-07T12:01:32Z

We'll need two new PRs (one for the dist-arm runner and other for the arm64 optimized runner).

…llaumeGomez Rollup of 9 pull requests Successful merges: - rust-lang#135081 (bootstrap: Build jemalloc with support for 64K pages) - rust-lang#135174 ([AIX] Port test case run-make/reproducible-build ) - rust-lang#135177 (llvm: Ignore error value that is always false) - rust-lang#135182 (Transmute from NonNull to pointer when elaborating a box deref (MCP807)) - rust-lang#135187 (apply a workaround fix for the release roadblock) - rust-lang#135189 (Remove workaround from pull request template) - rust-lang#135193 (don't bless `proc_macro_deps.rs` unless it's necessary) - rust-lang#135198 (Avoid naming variables `str`) - rust-lang#135199 (Eliminate an unnecessary `Symbol::to_string`; use `as_str`) r? `@ghost` `@rustbot` modify labels: rollup

Rollup merge of rust-lang#135081 - mrkajetanp:jemalloc-64k, r=Kobzol bootstrap: Build jemalloc with support for 64K pages By default, jemalloc is built to only support the same page size as the host machine. Set an env variable so that jemalloc is built with support for page sizes up to 64K regardless of the host machine. r? `@Kobzol` Resolves rust-lang#134563 Potentially resolves rust-lang#133748 (needs verification) ---- Results from local rustc-perf testing below, within 0.5% on every metric except max-rss. AArch64: ![Screenshot 2025-01-03 at 5 53 13 pm](https://github.com/user-attachments/assets/71705c59-7d7b-4753-a184-8c784233e603) x86_64: ![Screenshot 2025-01-03 at 5 54 16 pm](https://github.com/user-attachments/assets/ea28aded-3b90-43f4-a965-b081b07b95ab)

rustbot assigned Kobzol Jan 3, 2025

rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-bootstrap Relevant to the bootstrap subteam: Rust's build system (x.py and src/bootstrap) labels Jan 3, 2025