Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Arm64: Implement region write barriers #111636
base: main
Are you sure you want to change the base?
Arm64: Implement region write barriers #111636
Changes from 3 commits
db6c2cf
5de7e0f
9315aa1
d0e46f0
cb83f53
c615772
1c865f1
15dde1b
b5b28ce
c27d516
d014417
6068fb0
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs to be lock-free atomic update (CAS). If two threads are setting two different bits in the card table, we need to make sure that one does not overwrite the update done by the other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I wasn't sure if this needed to be atomic or not.
I also need to switch this to use
test
instead ofcmp
, to test for just the single bit.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't use LSE atomics for this, we might see perf impact. The suggestion would be to check if LSE atomics is present in write barrier manager and use "BIT" version (precise write barrier) otherwise fallback to "BYTE" version (non-precise write barrier).
Edit: For AOT scenarios, we might be better off using just the BYTE version because we won't know if LSE atomics is available on target machine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The atomic bit store without LSE requires a
ldaxrb+orr+stlxrb+cbnz
loop and an additional temp register. With LSE it can be done with a singlestsetb
. Agree this should only be done for LSE.I'll add an LSE check in the GC code on init, and if false then unset
region_use_bitwise_write_barrier
. Then use LSE for bit write barriers (which I'll have to do in raw hex for it to compile)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be in the VM where we have the infrastructure to detect LSE atomics, it does not need to be in the GC code.
region_use_bitwise_write_barrier
is not a hard requirement - it is ok for VM to ignore the request to use bitwise write barrier.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is up to 30% regression in write barrier micro-benchmarks on Cobalt 100: EgorBot/runtime-utils#271 (comment)
I think it would be a good idea to have multiple static versions of the write barrier to minimize the regression and to provide option to go to non-bitwise write barrier like we have on x64.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My hope was to split this work into two pieces. First this PR, and then a second for the multiple versions. But, it sounds like the regressions would block this from going in.
If we assume that splitting into multiple versions will get rid of the regressions, then has the current version of this PR shown enough improvements in the GC for it to be a worthwhile change? I think the stats from OrchardCMS do show that, but I'm not sure how significant they are. If so, then I can look at doing the splitting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume we can estimate that locally first?
It probably indeed would be better to start from splitting, e.g. the current WB has a redundant "is ephemeral" checks when Server GC is enabled - I tried to handle it in #106934
The improvements for GC pauses indeed look cool and hopefully will have a noticeable impact for certain workloads, however, if I remember correctly, we got some complains on throughput after the x64 precise write barriers landed (it basically regressed performance in many microbenchmarks, i.e. #74014)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, we try to avoid large scale performance regression-improvements zig-zags. They create noise in our performance tracing system that takes extra work to deal with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, let's avoid committing this as it is then.
Was this before x64 added multiple versions?
Looking at the writebarriermanager code for x64, I think it'll be fairly easy to move all of it into a new file and make it work for Arm64. Resulting in the same number of versions on Arm64. The code to edit the constants would just write to the addresses at the end of the function (instead of inline like x64). That would avoid writing "new" functionality, and it'd be useable for other architectures too. Unless there are any reasons for not reusing the writebarriermanager?