Arm64: Implement region write barriers #111636

a74nh · 2025-01-20T18:14:03Z

Extend the Arm64 writebarrier function to support regions. The assembly is updated similar to that for AMD64.

This is expected to make the writebarrier slower, but improve the performance of the GC.

dotnet-policy-service · 2025-01-20T18:14:48Z

Tagging subscribers to this area: @mangod9
See info in area-owners.md if you want to be subscribed.

src/coreclr/vm/arm64/patchedcode.S

jkotas · 2025-01-21T14:25:25Z

src/coreclr/vm/arm64/patchedcode.S

+    beq  LOCAL_LABEL(Exit)
+
+    // Update the card table
+    // TODO: Is this correct? the AMD64 code is odd.


I think this needs to be lock-free atomic update (CAS). If two threads are setting two different bits in the card table, we need to make sure that one does not overwrite the update done by the other.

Right, I wasn't sure if this needed to be atomic or not.

I also need to switch this to use test instead of cmp, to test for just the single bit.

If we don't use LSE atomics for this, we might see perf impact. The suggestion would be to check if LSE atomics is present in write barrier manager and use "BIT" version (precise write barrier) otherwise fallback to "BYTE" version (non-precise write barrier).

Edit: For AOT scenarios, we might be better off using just the BYTE version because we won't know if LSE atomics is available on target machine.

The atomic bit store without LSE requires a ldaxrb+orr+stlxrb+cbnz loop and an additional temp register. With LSE it can be done with a single stsetb. Agree this should only be done for LSE.

I'll add an LSE check in the GC code on init, and if false then unset region_use_bitwise_write_barrier. Then use LSE for bit write barriers (which I'll have to do in raw hex for it to compile)

I'll add an LSE check in the GC code on init

It can be in the VM where we have the infrastructure to detect LSE atomics, it does not need to be in the GC code. region_use_bitwise_write_barrier is not a hard requirement - it is ok for VM to ignore the request to use bitwise write barrier.

This change is up to 30% regression in write barrier micro-benchmarks on Cobalt 100: EgorBot/runtime-utils#271 (comment)

I think it would be a good idea to have multiple static versions of the write barrier to minimize the regression and to provide option to go to non-bitwise write barrier like we have on x64.

My hope was to split this work into two pieces. First this PR, and then a second for the multiple versions. But, it sounds like the regressions would block this from going in.

If we assume that splitting into multiple versions will get rid of the regressions, then has the current version of this PR shown enough improvements in the GC for it to be a worthwhile change? I think the stats from OrchardCMS do show that, but I'm not sure how significant they are. If so, then I can look at doing the splitting.

If we assume that splitting into multiple versions will get rid of the regressions, then has the current version of this PR shown enough improvements in the GC for it to be a worthwhile change?

I assume we can estimate that locally first?

First this PR, and then a second for the multiple versions.

It probably indeed would be better to start from splitting, e.g. the current WB has a redundant "is ephemeral" checks when Server GC is enabled - I tried to handle it in #106934

I think the #111636 (comment) do show that, but I'm not sure how significant they are.

The improvements for GC pauses indeed look cool and hopefully will have a noticeable impact for certain workloads, however, if I remember correctly, we got some complains on throughput after the x64 precise write barriers landed (it basically regressed performance in many microbenchmarks, i.e. #74014)

Also, we try to avoid large scale performance regression-improvements zig-zags. They create noise in our performance tracing system that takes extra work to deal with.

Ok, let's avoid committing this as it is then.

if I remember correctly, we got some complains on throughput after the x64 precise write barriers landed (it basically regressed performance in many microbenchmarks, i.e. #74014)

Was this before x64 added multiple versions?

It probably indeed would be better to start from splitting, e.g. the current WB has a redundant "is ephemeral" checks when Server GC is enabled - I tried to handle it in #106934

Looking at the writebarriermanager code for x64, I think it'll be fairly easy to move all of it into a new file and make it work for Arm64. Resulting in the same number of versions on Arm64. The code to edit the constants would just write to the addresses at the end of the function (instead of inline like x64). That would avoid writing "new" functionality, and it'd be useable for other architectures too. Unless there are any reasons for not reusing the writebarriermanager?

kunalspathak · 2025-01-21T15:40:41Z

FYI - @Maoni0
@mrsharm @cshung - what preliminary tests can we run to validate the performance impact?

a74nh · 2025-01-21T18:18:06Z

I also have a bunch of notes where I rewrote the AMD64 and ARM64 write barrier assembly in pseudo code. I'll tidy up and add somewhere in docs/

src/coreclr/vm/arm64/asmhelpers.S

EgorBo · 2025-01-23T15:00:25Z

@a74nh I'm just curious, is this ready for benchmarks? (on linux-arm64)

a74nh · 2025-01-23T15:06:46Z

@a74nh I'm just curious, is this ready for benchmarks? (on linux-arm64)

I think all the failures are fixed up now. So, yes, this would be a good time. If you've got something to run that'd be great.

I've been using your orchard.sh script that runs on a single machine, on 4 cores (+1 for wrk). I don't see any improvement in reqs per sec, although not sure if that's a good enough test.

src/coreclr/vm/gcenv.ee.cpp

EgorBo · 2025-01-23T15:15:36Z

I've been using your orchard.sh script that runs on a single machine, on 4 cores (+1 for wrk). I don't see any improvement in reqs per sec, although not sure if that's a good enough test.

Afair it's not bottle-necked in Write-Barrier + presumably, your PR is supposed to decrease average GC pause rather than WB's throughput? So you might want to look at the GC stats? the orchard.sh should have USE_DOTNET_TRACE property that you need to set to 1 to grab traces (and set DOTNET_TRACE_ARGS to listen to gc events specifically)

EgorBo · 2025-01-23T15:44:18Z

@EgorBot -linux_azure_cobalt100 -linux_azure_ampere -profiler

using BenchmarkDotNet.Attributes;

public class MyBench
{
    object Dst1;
    object Dst2;
    object Dst3;
    object Dst4;

    static object Value = new();

    [Benchmark]
    public void WB_nonephemeral()
    {
        // Write non-ephemeral reference
        Dst1 = Value;
        Dst2 = Value;
        Dst3 = Value;
        Dst4 = Value;
    }

    [Benchmark]
    public void WB_ephemeral()
    {
        // Write non-ephemeral reference
        Dst1 = new object();
    }
}

EgorBo · 2025-01-23T16:48:11Z

I guess it's sort of expected that it's slower throughput wise in microbenchmarks. the WB_nonephemeral perf is mostly here: https://gist.github.com/EgorBot/a6db6579aba05de6a25f111513cb54b2#file-diff_asm_bcd38073-asm-L30 which is, I guess,

    // Check whether the region we're storing into is gen 0 - nothing to do in this case
    ldrb w12, [x12]
    cbz  w12, LOCAL_LABEL(Exit)

(I guess I should've added an extra benchmark where object we're storing is gen2)

PS: feel free to call the bot yourself if needed

src/coreclr/vm/gcenv.ee.cpp

mrsharm · 2025-01-24T16:08:19Z

FYI - @Maoni0 @mrsharm @cshung - what preliminary tests can we run to validate the performance impact?

Sorry for the delay. I would run the microbenchmarks with and without this change on the pertinent hardware on the following tests given below for a sufficient number of iterations (as some of these exhibit a considerable amount of variance). The other considerations while running these is to ensure that the number of GCs is equivalent between the baseline and the comparand - this can be achieved by:

Not removing the outliers: --outliers DontRemove.
Setting a fixed number of invocations that'll be high enough to reduce the standard error: --invocationCount {InvocationCount}
Setting a fixed number of iterations: --iterationCount 20.

- System.Numerics.Tests.Perf_BigInteger.Add(arguments: 65536*)
- System.Tests.Perf_GC<Byte>.AllocateArray(length: 1000, *)
- System.Tests.Perf_GC<Char>.AllocateArray(length: 1000, *)
- System.Tests.Perf_GC<Byte>.AllocateArray(length: 10000, *)
- System.Tests.Perf_GC<Char>.AllocateArray(length: 10000, *)
- System.Tests.Perf_GC<Byte>.AllocateUninitializedArray(length: 1000, *)
- System.Tests.Perf_GC<Char>.AllocateUninitializedArray(length: 1000, *)
- System.Tests.Perf_GC<Byte>.AllocateUninitializedArray(length: 10000, *)
- System.Tests.Perf_GC<Char>.AllocateUninitializedArray(length: 10000, *)
- System.Tests.Perf_GC<Byte>.NewOperator_Array(length: 1000)
- System.Tests.Perf_GC<Byte>.NewOperator_Array(length: 10000)
- System.Tests.Perf_GC<Char>.NewOperator_Array(length: 1000)
- System.Tests.Perf_GC<Char>.NewOperator_Array(length: 10000)
- System.IO.Tests.Perf_File.ReadAllBytesAsync(size: 104857600)
- System.Numerics.Tests.Perf_BigInteger.Subtract(arguments: 65536*)
- System.Collections.CtorGivenSize<String>.Array(size: 512)
- ByteMark.BenchBitOps
- System.IO.Tests.Perf_File.ReadAllBytes(size: 104857600)
- System.IO.Tests.Perf_File.ReadAllBytesAsync(size: 104857600)
- System.Linq.Tests.Perf_Enumerable.ToArray*
- System.Collections.Tests.Perf_BitArray.BitArrayByteArrayCtor(size: 512)

Once the microbenchmarks are run, the pertinent metrics would be the % difference in the time of execution of a test + the standard error of tests.

As a note: the following for the regression that was created because of us moving to a More Precise Write Barrier for x64: #73783 - seems like one of the affected microbenchmarks is already in the aforementioned list. I remember StackWalk being extremely volatile but still worth trying out with.

cshung · 2025-01-24T18:45:02Z

As we run the benchmarks, I would pay attention to ephemeral GC pause time, in particular the time spent on marking cards.

a74nh · 2025-01-27T12:48:23Z

Sorry for the delay. I would run the microbenchmarks with and without this change on the pertinent hardware on the following tests given below for a sufficient number of iterations (as some of these exhibit a considerable amount of variance). The other considerations while running these is to ensure that the number of GCs is equivalent between the baseline and the comparand - this can be achieved by:

running most of the tests as suggested, I don't see any differences. Everything seems within error margins:



| Method                     | Job        | Toolchain                                                                          | length | pinned | Mean        | Error     | StdDev    | Median      | Min         | Max        | Ratio | MannWhitney(2%) | RatioSD | Gen0   | Gen1   | Gen2   | Allocated | Alloc Ratio |
|--------------------------- |----------- |----------------------------------------------------------------------------------- |------- |------- |------------:|----------:|----------:|------------:|------------:|-----------:|------:|---------------- |--------:|-------:|-------:|-------:|----------:|------------:|
| AllocateArray              | Job-CWSSJX | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 1000   | False  |   129.78 ns | 53.253 ns | 61.326 ns |   118.07 ns |   108.50 ns |   388.8 ns |  1.08 | Baseline        |    0.54 | 0.0152 |      - |      - |   1.98 KB |        1.00 |
| AllocateArray              | Job-EDJSHX | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 1000   | False  |   137.49 ns | 53.415 ns | 61.512 ns |   125.97 ns |   116.80 ns |   396.9 ns |  1.15 | Same            |    0.54 | 0.0152 |      - |      - |   1.98 KB |        1.00 |
|                            |            |                                                                                    |        |        |             |           |           |             |             |            |       |                 |         |        |        |        |           |             |
| AllocateUninitializedArray | Job-CWSSJX | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 1000   | False  |   103.60 ns | 51.462 ns | 59.263 ns |    89.10 ns |    88.63 ns |   354.8 ns |  1.11 | Baseline        |    0.66 | 0.0152 |      - |      - |   1.98 KB |        1.00 |
| AllocateUninitializedArray | Job-EDJSHX | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 1000   | False  |   103.35 ns | 51.294 ns | 59.070 ns |    88.76 ns |    88.21 ns |   353.4 ns |  1.10 | Same            |    0.65 | 0.0152 |      - |      - |   1.98 KB |        1.00 |
|                            |            |                                                                                    |        |        |             |           |           |             |             |            |       |                 |         |        |        |        |           |             |
| AllocateArray              | Job-CWSSJX | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 1000   | True   |   744.34 ns |  7.498 ns |  8.634 ns |   741.62 ns |   735.19 ns |   764.7 ns |  1.00 | Baseline        |    0.02 | 0.6364 | 0.6364 | 0.6364 |   1.98 KB |        1.00 |
| AllocateArray              | Job-EDJSHX | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 1000   | True   |   743.07 ns |  9.170 ns | 10.561 ns |   740.52 ns |   732.56 ns |   763.7 ns |  1.00 | Same            |    0.02 | 0.6364 | 0.6364 | 0.6364 |   1.98 KB |        1.00 |
|                            |            |                                                                                    |        |        |             |           |           |             |             |            |       |                 |         |        |        |        |           |             |
| AllocateUninitializedArray | Job-CWSSJX | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 1000   | True   |   735.06 ns | 10.791 ns | 12.426 ns |   728.98 ns |   720.78 ns |   757.2 ns |  1.00 | Baseline        |    0.02 | 0.6364 | 0.6364 | 0.6364 |   1.98 KB |        1.00 |
| AllocateUninitializedArray | Job-EDJSHX | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 1000   | True   |   748.82 ns |  8.844 ns | 10.185 ns |   743.99 ns |   736.23 ns |   767.8 ns |  1.02 | Same            |    0.02 | 0.6364 | 0.6364 | 0.6364 |   1.98 KB |        1.00 |
|                            |            |                                                                                    |        |        |             |           |           |             |             |            |       |                 |         |        |        |        |           |             |
| AllocateArray              | Job-CWSSJX | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 10000  | False  |   626.94 ns | 39.042 ns | 44.961 ns |   618.03 ns |   588.73 ns |   805.0 ns |  1.00 | Baseline        |    0.09 | 0.2879 |      - |      - |  19.55 KB |        1.00 |
| AllocateArray              | Job-EDJSHX | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 10000  | False  |   623.92 ns | 74.318 ns | 85.585 ns |   601.31 ns |   589.99 ns |   983.1 ns |  1.00 | Same            |    0.15 | 0.2879 |      - |      - |  19.55 KB |        1.00 |
|                            |            |                                                                                    |        |        |             |           |           |             |             |            |       |                 |         |        |        |        |           |             |
| AllocateUninitializedArray | Job-CWSSJX | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 10000  | False  |   142.84 ns | 17.866 ns | 20.575 ns |   138.18 ns |   134.39 ns |   228.9 ns |  1.01 | Baseline        |    0.17 | 0.2879 |      - |      - |  19.55 KB |        1.00 |
| AllocateUninitializedArray | Job-EDJSHX | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 10000  | False  |   149.25 ns | 16.513 ns | 19.016 ns |   146.35 ns |   137.79 ns |   227.3 ns |  1.06 | Same            |    0.16 | 0.2879 |      - |      - |  19.55 KB |        1.00 |
|                            |            |                                                                                    |        |        |             |           |           |             |             |            |       |                 |         |        |        |        |           |             |
| AllocateArray              | Job-CWSSJX | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 10000  | True   | 2,592.21 ns | 32.371 ns | 37.278 ns | 2,585.44 ns | 2,550.16 ns | 2,707.3 ns |  1.00 | Baseline        |    0.02 | 6.3182 | 6.3182 | 6.3182 |  19.56 KB |        1.00 |
| AllocateArray              | Job-EDJSHX | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 10000  | True   | 2,475.21 ns | 76.425 ns | 88.011 ns | 2,436.47 ns | 2,379.59 ns | 2,637.6 ns |  0.96 | Same            |    0.04 | 6.3182 | 6.3182 | 6.3182 |  19.56 KB |        1.00 |
|                            |            |                                                                                    |        |        |             |           |           |             |             |            |       |                 |         |        |        |        |           |             |
| AllocateUninitializedArray | Job-CWSSJX | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 10000  | True   | 2,438.40 ns | 43.482 ns | 50.074 ns | 2,444.35 ns | 2,330.27 ns | 2,527.3 ns |  1.00 | Baseline        |    0.03 | 6.3182 | 6.3182 | 6.3182 |  19.56 KB |        1.00 |
| AllocateUninitializedArray | Job-EDJSHX | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 10000  | True   | 2,449.01 ns | 35.429 ns | 40.800 ns | 2,448.20 ns | 2,338.34 ns | 2,520.9 ns |  1.00 | Same            |    0.03 | 6.3182 | 6.3182 | 6.3182 |  19.56 KB |        1.00 |
|                            |            |                                                                                    |        |        |             |           |           |             |             |            |       |                 |         |        |        |        |           |             |
| NewOperator_Array          | Job-CWSSJX | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 1000   | ?      |    98.53 ns | 49.747 ns | 57.289 ns |    86.26 ns |    74.80 ns |   340.4 ns |  1.11 | Baseline        |    0.67 | 0.0152 |      - |      - |   1.98 KB |        1.00 |
| NewOperator_Array          | Job-EDJSHX | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 1000   | ?      |    95.01 ns | 48.560 ns | 55.922 ns |    80.60 ns |    79.98 ns |   331.4 ns |  1.07 | Same            |    0.66 | 0.0152 |      - |      - |   1.98 KB |        1.00 |
|                            |            |                                                                                    |        |        |             |           |           |             |             |            |       |                 |         |        |        |        |           |             |
| NewOperator_Array          | Job-CWSSJX | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 10000  | ?      |   546.14 ns | 49.634 ns | 57.159 ns |   533.12 ns |   520.12 ns |   784.7 ns |  1.01 | Baseline        |    0.13 | 0.2879 |      - |      - |  19.55 KB |        1.00 |
| NewOperator_Array          | Job-EDJSHX | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 10000  | ?      |   551.71 ns | 52.751 ns | 60.748 ns |   537.58 ns |   528.97 ns |   807.3 ns |  1.02 | Same            |    0.13 | 0.2879 |      - |      - |  19.55 KB |        1.00 |


| Method | Job        | Toolchain                                                                          | arguments        | Mean        | Error      | StdDev     | Median      | Min         | Max         | Ratio | MannWhitney(2%) | RatioSD | Gen0   | Allocated | Alloc Ratio |
|------- |----------- |----------------------------------------------------------------------------------- |----------------- |------------:|-----------:|-----------:|------------:|------------:|------------:|------:|---------------- |--------:|-------:|----------:|------------:|
| Add    | Job-VIYVLB | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 1024,1024 bits   |   205.72 ns | 129.897 ns | 149.589 ns |    84.26 ns |    71.82 ns |   404.32 ns |  1.78 | Baseline        |    1.85 |      - |     160 B |        1.00 |
| Add    | Job-VRIONI | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 1024,1024 bits   |   203.72 ns | 129.080 ns | 148.649 ns |    83.54 ns |    72.15 ns |   400.73 ns |  1.76 | Same            |    1.84 |      - |     160 B |        1.00 |
|        |            |                                                                                    |                  |             |            |            |             |             |             |       |                 |         |        |           |             |
| Add    | Job-VIYVLB | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 16,16 bits       |    25.58 ns |   0.439 ns |   0.505 ns |    25.63 ns |    23.68 ns |    26.00 ns |  1.00 | Baseline        |    0.03 |      - |         - |          NA |
| Add    | Job-VRIONI | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 16,16 bits       |    24.67 ns |   1.307 ns |   1.506 ns |    24.99 ns |    21.90 ns |    26.31 ns |  0.97 | Same            |    0.06 |      - |         - |          NA |
|        |            |                                                                                    |                  |             |            |            |             |             |             |       |                 |         |        |           |             |
| Add    | Job-VIYVLB | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 65536,65536 bits | 3,591.60 ns |  74.221 ns |  85.473 ns | 3,559.69 ns | 3,555.19 ns | 3,919.99 ns |  1.00 | Baseline        |    0.03 | 0.1212 |    8224 B |        1.00 |
| Add    | Job-VRIONI | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 65536,65536 bits | 3,571.79 ns |  69.881 ns |  80.475 ns | 3,551.91 ns | 3,546.31 ns | 3,911.55 ns |  0.99 | Same            |    0.03 | 0.1212 |    8224 B |        1.00 |


| Method   | Job        | Toolchain                                                                          | arguments        | Mean        | Error      | StdDev     | Median      | Min         | Max         | Ratio | MannWhitney(2%) | RatioSD | Gen0   | Allocated | Alloc Ratio |
|--------- |----------- |----------------------------------------------------------------------------------- |----------------- |------------:|-----------:|-----------:|------------:|------------:|------------:|------:|---------------- |--------:|-------:|----------:|------------:|
| Subtract | Job-KDZVCP | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 1024,1024 bits   |   145.80 ns | 116.856 ns | 134.571 ns |    72.70 ns |    72.08 ns |   426.39 ns |  1.59 | Baseline        |    1.70 |      - |     152 B |        1.00 |
| Subtract | Job-KKPRIL | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 1024,1024 bits   |   143.24 ns | 118.524 ns | 136.493 ns |    72.22 ns |    71.90 ns |   431.54 ns |  1.57 | Same            |    1.72 |      - |     152 B |        1.00 |
|          |            |                                                                                    |                  |             |            |            |             |             |             |       |                 |         |        |           |             |
| Subtract | Job-KDZVCP | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 16,16 bits       |    26.41 ns |   0.836 ns |   0.963 ns |    26.88 ns |    24.34 ns |    27.34 ns |  1.00 | Baseline        |    0.05 |      - |         - |          NA |
| Subtract | Job-KKPRIL | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 16,16 bits       |    26.22 ns |   0.666 ns |   0.767 ns |    26.29 ns |    24.35 ns |    27.18 ns |  0.99 | Same            |    0.05 |      - |         - |          NA |
|          |            |                                                                                    |                  |             |            |            |             |             |             |       |                 |         |        |           |             |
| Subtract | Job-KDZVCP | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 65536,65536 bits | 3,483.97 ns |  61.051 ns |  70.306 ns | 3,466.17 ns | 3,458.38 ns | 3,780.31 ns |  1.00 | Baseline        |    0.03 | 0.1212 |    8216 B |        1.00 |
| Subtract | Job-KKPRIL | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 65536,65536 bits | 3,526.84 ns |  71.010 ns |  81.775 ns | 3,504.11 ns | 3,480.61 ns | 3,840.66 ns |  1.01 | Same            |    0.03 | 0.1212 |    8216 B |        1.00 |


| Method | Job        | Toolchain                                                                          | Size | Mean     | Error   | StdDev  | Median   | Min      | Max      | Ratio | MannWhitney(2%) | RatioSD | Gen0   | Allocated | Alloc Ratio |
|------- |----------- |----------------------------------------------------------------------------------- |----- |---------:|--------:|--------:|---------:|---------:|---------:|------:|---------------- |--------:|-------:|----------:|------------:|
| Array  | Job-CZKOLC | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 512  | 152.8 ns | 7.44 ns | 8.56 ns | 149.6 ns | 147.4 ns | 186.8 ns |  1.00 | Baseline        |    0.07 | 0.0606 |   4.02 KB |        1.00 |
| Array  | Job-FQHBTF | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 512  | 155.3 ns | 4.66 ns | 5.36 ns | 154.5 ns | 151.6 ns | 177.2 ns |  1.02 | Same            |    0.06 | 0.0606 |   4.02 KB |        1.00 |


| Method  | Job        | Toolchain                                                                          | input       | Mean      | Error    | StdDev    | Median    | Min       | Max       | Ratio | MannWhitney(2%) | RatioSD | Gen0   | Allocated | Alloc Ratio |
|-------- |----------- |----------------------------------------------------------------------------------- |------------ |----------:|---------:|----------:|----------:|----------:|----------:|------:|---------------- |--------:|-------:|----------:|------------:|
| ToArray | Job-QHOIJP | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | ICollection |  41.88 ns | 9.097 ns | 10.476 ns |  37.78 ns |  36.30 ns |  80.16 ns |  1.04 | Baseline        |    0.30 | 0.0061 |     424 B |        1.00 |
| ToArray | Job-GOWGBS | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | ICollection |  43.15 ns | 9.478 ns | 10.915 ns |  36.91 ns |  36.21 ns |  79.58 ns |  1.07 | Same            |    0.31 | 0.0061 |     424 B |        1.00 |
|         |            |                                                                                    |             |           |          |           |           |           |           |       |                 |         |        |           |             |
| ToArray | Job-QHOIJP | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | IEnumerable | 287.98 ns | 5.110 ns |  5.885 ns | 286.38 ns | 285.61 ns | 312.59 ns |  1.00 | Baseline        |    0.03 | 0.0061 |     456 B |        1.00 |
| ToArray | Job-GOWGBS | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | IEnumerable | 289.73 ns | 4.845 ns |  5.580 ns | 287.99 ns | 287.74 ns | 313.07 ns |  1.01 | Same            |    0.03 | 0.0061 |     456 B |        1.00 |


| Method                | Job        | Toolchain                                                                          | Size | Mean      | Error     | StdDev    | Median    | Min       | Max       | Ratio | MannWhitney(2%) | RatioSD | Gen0   | Allocated | Alloc Ratio |
|---------------------- |----------- |----------------------------------------------------------------------------------- |----- |----------:|----------:|----------:|----------:|----------:|----------:|------:|---------------- |--------:|-------:|----------:|------------:|
| BitArrayByteArrayCtor | Job-WNOFTX | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 4    |  21.91 ns | 11.631 ns | 13.395 ns |  14.75 ns |  14.65 ns |  57.33 ns |  1.24 | Baseline        |    0.88 |      - |      64 B |        1.00 |
| BitArrayByteArrayCtor | Job-QPXJRV | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 4    |  22.39 ns | 11.757 ns | 13.540 ns |  15.89 ns |  15.70 ns |  60.19 ns |  1.27 | Same            |    0.90 |      - |      64 B |        1.00 |
|                       |            |                                                                                    |      |           |           |           |           |           |           |       |                 |         |        |           |             |
| BitArrayByteArrayCtor | Job-WNOFTX | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 512  | 142.08 ns |  5.946 ns |  6.848 ns | 140.73 ns | 138.30 ns | 170.18 ns |  1.00 | Baseline        |    0.06 | 0.0076 |     568 B |        1.00 |
| BitArrayByteArrayCtor | Job-QPXJRV | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 512  | 139.35 ns |  5.774 ns |  6.650 ns | 137.68 ns | 136.98 ns | 167.37 ns |  0.98 | Same            |    0.06 | 0.0076 |     568 B |        1.00 |

a74nh · 2025-01-27T12:49:37Z

I've been using your orchard.sh script that runs on a single machine, on 4 cores (+1 for wrk). I don't see any improvement in reqs per sec, although not sure if that's a good enough test.

Afair it's not bottle-necked in Write-Barrier + presumably, your PR is supposed to decrease average GC pause rather than WB's throughput? So you might want to look at the GC stats? the orchard.sh should have USE_DOTNET_TRACE property that you need to set to 1 to grab traces (and set DOTNET_TRACE_ARGS to listen to gc events specifically)

Running Orchard with the tracing enabled....

Two runs using head:

Two runs using the PR:

Those figures look quite a bit better on the PR

a74nh · 2025-01-27T17:38:18Z

Added a file to the docs/ folder with pseudo code for the writebarrier function.

I'm taking this out of draft now.

docs/design/coreclr/jit/GC-write-barriers.md

jkotas · 2025-01-27T18:18:06Z

docs/design/coreclr/jit/GC-write-barriers.md

+                *(g_sw_ww_table + (dst>>11)) =  0xff
+
+
+    // Return if the reference is not in the heap


I do not think that this is correct. This checks whether reference is not in ephemeral generation (ie Gen 0).

Perhaps, worth mentioning then that there is a Checked version of the same barrier with "Is not on heap" checks

Fixed the doc, including adding JIT_CheckedWriteBarrier() pseudo code.

jkotas · 2025-01-29T06:17:53Z

@EgorBot -linux_azure_cobalt100

using BenchmarkDotNet.Attributes;

public class MyBench
{
    object Dst1;
    object Dst2;
    object Dst3;
    object Dst4;

    static object Value = new();

    [Benchmark]
    public void WB_nonephemeral()
    {
        // Write non-ephemeral reference
        Dst1 = Value;
        Dst2 = Value;
        Dst3 = Value;
        Dst4 = Value;
    }
}

cshung · 2025-01-30T16:37:11Z

I wonder if the GC can make some automatically decision here to avoid wasted effort here.
Without a reasonably large Gen2 or a reasonably high false positive rate on the cards, there isn't much reason to go precise.

Arm64: Implement region write barriers

db6c2cf

dotnet-issue-labeler bot added the area-VM-coreclr label Jan 20, 2025

dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Jan 20, 2025

a74nh added 2 commits January 21, 2025 11:31

Fix byte region barriers

5de7e0f

Fix bit region barriers

9315aa1

EgorBo reviewed Jan 21, 2025

View reviewed changes

src/coreclr/vm/arm64/patchedcode.S Show resolved Hide resolved

jkotas reviewed Jan 21, 2025

View reviewed changes

a74nh added 2 commits January 22, 2025 10:29

test instead of cmp for bitwise write barriers

d0e46f0

use LSE to atomically update bitwise write barriers

cb83f53

jkotas reviewed Jan 22, 2025

View reviewed changes

src/coreclr/vm/arm64/asmhelpers.S Outdated Show resolved Hide resolved

a74nh added 2 commits January 23, 2025 09:53

move atomics check into gcenv.ee.cpp

c615772

Skip ephemeral checks for regionless server GC, and refactor checks

1c865f1

EgorBo reviewed Jan 23, 2025

View reviewed changes

src/coreclr/vm/gcenv.ee.cpp Outdated Show resolved Hide resolved

EgorBot mentioned this pull request Jan 23, 2025

EgorBot for EgorBo in #111636 EgorBot/runtime-utils#247

Open

kunalspathak added the arch-arm64 label Jan 23, 2025

jkotas reviewed Jan 23, 2025

View reviewed changes

src/coreclr/vm/gcenv.ee.cpp Outdated Show resolved Hide resolved

Move ephemeral checks back

15dde1b

Add GC-write-barriers.md

b5b28ce

a74nh marked this pull request as ready for review January 27, 2025 17:38

More variables for the pseudo code

c27d516

jkotas reviewed Jan 27, 2025

View reviewed changes

docs/design/coreclr/jit/GC-write-barriers.md Outdated Show resolved Hide resolved

WRITE_BARRIER_CHECK instead of TARGET_ARM64

d014417

jkotas reviewed Jan 27, 2025

View reviewed changes

docs/design/coreclr/jit/GC-write-barriers.md Outdated Show resolved Hide resolved

jkotas reviewed Jan 27, 2025

View reviewed changes

docs/design/coreclr/jit/GC-write-barriers.md Outdated Show resolved Hide resolved

jkotas reviewed Jan 27, 2025

View reviewed changes

Add JIT_CheckedWriteBarrier and fixups to doc

6068fb0

EgorBot mentioned this pull request Jan 29, 2025

EgorBot for jkotas in #111636 EgorBot/runtime-utils#271

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arm64: Implement region write barriers #111636

Arm64: Implement region write barriers #111636

a74nh commented Jan 20, 2025 •

edited

Loading

dotnet-policy-service bot commented Jan 20, 2025

jkotas Jan 21, 2025

a74nh Jan 21, 2025

kunalspathak Jan 21, 2025 •

edited

Loading

a74nh Jan 21, 2025

jkotas Jan 21, 2025

jkotas Jan 29, 2025

a74nh Jan 29, 2025

EgorBo Jan 29, 2025 •

edited

Loading

jkotas Jan 29, 2025

a74nh Jan 29, 2025

kunalspathak commented Jan 21, 2025 •

edited by mrsharm

Loading

a74nh commented Jan 21, 2025

EgorBo commented Jan 23, 2025

a74nh commented Jan 23, 2025

EgorBo commented Jan 23, 2025

EgorBo commented Jan 23, 2025

EgorBo commented Jan 23, 2025 •

edited

Loading

mrsharm commented Jan 24, 2025

cshung commented Jan 24, 2025

a74nh commented Jan 27, 2025

a74nh commented Jan 27, 2025

a74nh commented Jan 27, 2025

jkotas Jan 27, 2025

EgorBo Jan 27, 2025

a74nh Jan 28, 2025

jkotas commented Jan 29, 2025

cshung commented Jan 30, 2025

		*(g_sw_ww_table + (dst>>11)) = 0xff


		// Return if the reference is not in the heap

Arm64: Implement region write barriers #111636

Are you sure you want to change the base?

Arm64: Implement region write barriers #111636

Conversation

a74nh commented Jan 20, 2025 • edited Loading

dotnet-policy-service bot commented Jan 20, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kunalspathak Jan 21, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

EgorBo Jan 29, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kunalspathak commented Jan 21, 2025 • edited by mrsharm Loading

a74nh commented Jan 21, 2025

EgorBo commented Jan 23, 2025

a74nh commented Jan 23, 2025

EgorBo commented Jan 23, 2025

EgorBo commented Jan 23, 2025

EgorBo commented Jan 23, 2025 • edited Loading

mrsharm commented Jan 24, 2025

cshung commented Jan 24, 2025

a74nh commented Jan 27, 2025

a74nh commented Jan 27, 2025

a74nh commented Jan 27, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkotas commented Jan 29, 2025

cshung commented Jan 30, 2025

a74nh commented Jan 20, 2025 •

edited

Loading

kunalspathak Jan 21, 2025 •

edited

Loading

EgorBo Jan 29, 2025 •

edited

Loading

kunalspathak commented Jan 21, 2025 •

edited by mrsharm

Loading

EgorBo commented Jan 23, 2025 •

edited

Loading