-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Arm64: Implement region write barriers #111636
base: main
Are you sure you want to change the base?
Conversation
Tagging subscribers to this area: @mangod9 |
src/coreclr/vm/arm64/patchedcode.S
Outdated
beq LOCAL_LABEL(Exit) | ||
|
||
// Update the card table | ||
// TODO: Is this correct? the AMD64 code is odd. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs to be lock-free atomic update (CAS). If two threads are setting two different bits in the card table, we need to make sure that one does not overwrite the update done by the other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I wasn't sure if this needed to be atomic or not.
I also need to switch this to use test
instead of cmp
, to test for just the single bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't use LSE atomics for this, we might see perf impact. The suggestion would be to check if LSE atomics is present in write barrier manager and use "BIT" version (precise write barrier) otherwise fallback to "BYTE" version (non-precise write barrier).
Edit: For AOT scenarios, we might be better off using just the BYTE version because we won't know if LSE atomics is available on target machine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The atomic bit store without LSE requires a ldaxrb+orr+stlxrb+cbnz
loop and an additional temp register. With LSE it can be done with a single stsetb
. Agree this should only be done for LSE.
I'll add an LSE check in the GC code on init, and if false then unset region_use_bitwise_write_barrier
. Then use LSE for bit write barriers (which I'll have to do in raw hex for it to compile)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add an LSE check in the GC code on init
It can be in the VM where we have the infrastructure to detect LSE atomics, it does not need to be in the GC code. region_use_bitwise_write_barrier
is not a hard requirement - it is ok for VM to ignore the request to use bitwise write barrier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is up to 30% regression in write barrier micro-benchmarks on Cobalt 100: EgorBot/runtime-utils#271 (comment)
I think it would be a good idea to have multiple static versions of the write barrier to minimize the regression and to provide option to go to non-bitwise write barrier like we have on x64.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My hope was to split this work into two pieces. First this PR, and then a second for the multiple versions. But, it sounds like the regressions would block this from going in.
If we assume that splitting into multiple versions will get rid of the regressions, then has the current version of this PR shown enough improvements in the GC for it to be a worthwhile change? I think the stats from OrchardCMS do show that, but I'm not sure how significant they are. If so, then I can look at doing the splitting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we assume that splitting into multiple versions will get rid of the regressions, then has the current version of this PR shown enough improvements in the GC for it to be a worthwhile change?
I assume we can estimate that locally first?
First this PR, and then a second for the multiple versions.
It probably indeed would be better to start from splitting, e.g. the current WB has a redundant "is ephemeral" checks when Server GC is enabled - I tried to handle it in #106934
I think the #111636 (comment) do show that, but I'm not sure how significant they are.
The improvements for GC pauses indeed look cool and hopefully will have a noticeable impact for certain workloads, however, if I remember correctly, we got some complains on throughput after the x64 precise write barriers landed (it basically regressed performance in many microbenchmarks, i.e. #74014)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, we try to avoid large scale performance regression-improvements zig-zags. They create noise in our performance tracing system that takes extra work to deal with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, let's avoid committing this as it is then.
if I remember correctly, we got some complains on throughput after the x64 precise write barriers landed (it basically regressed performance in many microbenchmarks, i.e. #74014)
Was this before x64 added multiple versions?
It probably indeed would be better to start from splitting, e.g. the current WB has a redundant "is ephemeral" checks when Server GC is enabled - I tried to handle it in #106934
Looking at the writebarriermanager code for x64, I think it'll be fairly easy to move all of it into a new file and make it work for Arm64. Resulting in the same number of versions on Arm64. The code to edit the constants would just write to the addresses at the end of the function (instead of inline like x64). That would avoid writing "new" functionality, and it'd be useable for other architectures too. Unless there are any reasons for not reusing the writebarriermanager?
I also have a bunch of notes where I rewrote the AMD64 and ARM64 write barrier assembly in pseudo code. I'll tidy up and add somewhere in docs/ |
@a74nh I'm just curious, is this ready for benchmarks? (on linux-arm64) |
I think all the failures are fixed up now. So, yes, this would be a good time. If you've got something to run that'd be great. I've been using your |
Afair it's not bottle-necked in Write-Barrier + presumably, your PR is supposed to decrease average GC pause rather than WB's throughput? So you might want to look at the GC stats? the |
@EgorBot -linux_azure_cobalt100 -linux_azure_ampere -profiler using BenchmarkDotNet.Attributes;
public class MyBench
{
object Dst1;
object Dst2;
object Dst3;
object Dst4;
static object Value = new();
[Benchmark]
public void WB_nonephemeral()
{
// Write non-ephemeral reference
Dst1 = Value;
Dst2 = Value;
Dst3 = Value;
Dst4 = Value;
}
[Benchmark]
public void WB_ephemeral()
{
// Write non-ephemeral reference
Dst1 = new object();
}
} |
I guess it's sort of expected that it's slower throughput wise in microbenchmarks. the // Check whether the region we're storing into is gen 0 - nothing to do in this case
ldrb w12, [x12]
cbz w12, LOCAL_LABEL(Exit) (I guess I should've added an extra benchmark where object we're storing is gen2) PS: feel free to call the bot yourself if needed |
Sorry for the delay. I would run the microbenchmarks with and without this change on the pertinent hardware on the following tests given below for a sufficient number of iterations (as some of these exhibit a considerable amount of variance). The other considerations while running these is to ensure that the number of GCs is equivalent between the baseline and the comparand - this can be achieved by:
Once the microbenchmarks are run, the pertinent metrics would be the % difference in the time of execution of a test + the standard error of tests. As a note: the following for the regression that was created because of us moving to a More Precise Write Barrier for x64: #73783 - seems like one of the affected microbenchmarks is already in the aforementioned list. I remember |
As we run the benchmarks, I would pay attention to ephemeral GC pause time, in particular the time spent on marking cards. |
running most of the tests as suggested, I don't see any differences. Everything seems within error margins:
|
Running Orchard with the tracing enabled.... Two runs using head: Two runs using the PR: Those figures look quite a bit better on the PR |
Added a file to the docs/ folder with pseudo code for the writebarrier function. I'm taking this out of draft now. |
*(g_sw_ww_table + (dst>>11)) = 0xff | ||
|
||
|
||
// Return if the reference is not in the heap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think that this is correct. This checks whether reference is not in ephemeral generation (ie Gen 0).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps, worth mentioning then that there is a Checked version of the same barrier with "Is not on heap" checks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed the doc, including adding JIT_CheckedWriteBarrier()
pseudo code.
@EgorBot -linux_azure_cobalt100
|
I wonder if the GC can make some automatically decision here to avoid wasted effort here. |
Extend the Arm64 writebarrier function to support regions. The assembly is updated similar to that for AMD64.
This is expected to make the writebarrier slower, but improve the performance of the GC.