Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arm64: Implement region write barriers #111636

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

a74nh
Copy link
Contributor

@a74nh a74nh commented Jan 20, 2025

Extend the Arm64 writebarrier function to support regions. The assembly is updated similar to that for AMD64.

This is expected to make the writebarrier slower, but improve the performance of the GC.

@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Jan 20, 2025
Copy link
Contributor

Tagging subscribers to this area: @mangod9
See info in area-owners.md if you want to be subscribed.

beq LOCAL_LABEL(Exit)

// Update the card table
// TODO: Is this correct? the AMD64 code is odd.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to be lock-free atomic update (CAS). If two threads are setting two different bits in the card table, we need to make sure that one does not overwrite the update done by the other.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I wasn't sure if this needed to be atomic or not.

I also need to switch this to use test instead of cmp, to test for just the single bit.

Copy link
Member

@kunalspathak kunalspathak Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't use LSE atomics for this, we might see perf impact. The suggestion would be to check if LSE atomics is present in write barrier manager and use "BIT" version (precise write barrier) otherwise fallback to "BYTE" version (non-precise write barrier).

Edit: For AOT scenarios, we might be better off using just the BYTE version because we won't know if LSE atomics is available on target machine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The atomic bit store without LSE requires a ldaxrb+orr+stlxrb+cbnz loop and an additional temp register. With LSE it can be done with a single stsetb. Agree this should only be done for LSE.

I'll add an LSE check in the GC code on init, and if false then unset region_use_bitwise_write_barrier. Then use LSE for bit write barriers (which I'll have to do in raw hex for it to compile)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add an LSE check in the GC code on init

It can be in the VM where we have the infrastructure to detect LSE atomics, it does not need to be in the GC code. region_use_bitwise_write_barrier is not a hard requirement - it is ok for VM to ignore the request to use bitwise write barrier.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is up to 30% regression in write barrier micro-benchmarks on Cobalt 100: EgorBot/runtime-utils#271 (comment)

I think it would be a good idea to have multiple static versions of the write barrier to minimize the regression and to provide option to go to non-bitwise write barrier like we have on x64.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My hope was to split this work into two pieces. First this PR, and then a second for the multiple versions. But, it sounds like the regressions would block this from going in.

If we assume that splitting into multiple versions will get rid of the regressions, then has the current version of this PR shown enough improvements in the GC for it to be a worthwhile change? I think the stats from OrchardCMS do show that, but I'm not sure how significant they are. If so, then I can look at doing the splitting.

Copy link
Member

@EgorBo EgorBo Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we assume that splitting into multiple versions will get rid of the regressions, then has the current version of this PR shown enough improvements in the GC for it to be a worthwhile change?

I assume we can estimate that locally first?

First this PR, and then a second for the multiple versions.

It probably indeed would be better to start from splitting, e.g. the current WB has a redundant "is ephemeral" checks when Server GC is enabled - I tried to handle it in #106934

I think the #111636 (comment) do show that, but I'm not sure how significant they are.

The improvements for GC pauses indeed look cool and hopefully will have a noticeable impact for certain workloads, however, if I remember correctly, we got some complains on throughput after the x64 precise write barriers landed (it basically regressed performance in many microbenchmarks, i.e. #74014)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we try to avoid large scale performance regression-improvements zig-zags. They create noise in our performance tracing system that takes extra work to deal with.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let's avoid committing this as it is then.

if I remember correctly, we got some complains on throughput after the x64 precise write barriers landed (it basically regressed performance in many microbenchmarks, i.e. #74014)

Was this before x64 added multiple versions?

It probably indeed would be better to start from splitting, e.g. the current WB has a redundant "is ephemeral" checks when Server GC is enabled - I tried to handle it in #106934

Looking at the writebarriermanager code for x64, I think it'll be fairly easy to move all of it into a new file and make it work for Arm64. Resulting in the same number of versions on Arm64. The code to edit the constants would just write to the addresses at the end of the function (instead of inline like x64). That would avoid writing "new" functionality, and it'd be useable for other architectures too. Unless there are any reasons for not reusing the writebarriermanager?

@kunalspathak
Copy link
Member

kunalspathak commented Jan 21, 2025

FYI - @Maoni0
@mrsharm @cshung - what preliminary tests can we run to validate the performance impact?

@a74nh
Copy link
Contributor Author

a74nh commented Jan 21, 2025

I also have a bunch of notes where I rewrote the AMD64 and ARM64 write barrier assembly in pseudo code. I'll tidy up and add somewhere in docs/

@EgorBo
Copy link
Member

EgorBo commented Jan 23, 2025

@a74nh I'm just curious, is this ready for benchmarks? (on linux-arm64)

@a74nh
Copy link
Contributor Author

a74nh commented Jan 23, 2025

@a74nh I'm just curious, is this ready for benchmarks? (on linux-arm64)

I think all the failures are fixed up now. So, yes, this would be a good time. If you've got something to run that'd be great.

I've been using your orchard.sh script that runs on a single machine, on 4 cores (+1 for wrk). I don't see any improvement in reqs per sec, although not sure if that's a good enough test.

src/coreclr/vm/gcenv.ee.cpp Outdated Show resolved Hide resolved
@EgorBo
Copy link
Member

EgorBo commented Jan 23, 2025

I've been using your orchard.sh script that runs on a single machine, on 4 cores (+1 for wrk). I don't see any improvement in reqs per sec, although not sure if that's a good enough test.

Afair it's not bottle-necked in Write-Barrier + presumably, your PR is supposed to decrease average GC pause rather than WB's throughput? So you might want to look at the GC stats? the orchard.sh should have USE_DOTNET_TRACE property that you need to set to 1 to grab traces (and set DOTNET_TRACE_ARGS to listen to gc events specifically)

@EgorBo
Copy link
Member

EgorBo commented Jan 23, 2025

@EgorBot -linux_azure_cobalt100 -linux_azure_ampere -profiler

using BenchmarkDotNet.Attributes;

public class MyBench
{
    object Dst1;
    object Dst2;
    object Dst3;
    object Dst4;

    static object Value = new();

    [Benchmark]
    public void WB_nonephemeral()
    {
        // Write non-ephemeral reference
        Dst1 = Value;
        Dst2 = Value;
        Dst3 = Value;
        Dst4 = Value;
    }

    [Benchmark]
    public void WB_ephemeral()
    {
        // Write non-ephemeral reference
        Dst1 = new object();
    }
}

@EgorBo
Copy link
Member

EgorBo commented Jan 23, 2025

I guess it's sort of expected that it's slower throughput wise in microbenchmarks. the WB_nonephemeral perf is mostly here: https://gist.github.com/EgorBot/a6db6579aba05de6a25f111513cb54b2#file-diff_asm_bcd38073-asm-L30 which is, I guess,

    // Check whether the region we're storing into is gen 0 - nothing to do in this case
    ldrb w12, [x12]
    cbz  w12, LOCAL_LABEL(Exit)

(I guess I should've added an extra benchmark where object we're storing is gen2)

PS: feel free to call the bot yourself if needed

src/coreclr/vm/gcenv.ee.cpp Outdated Show resolved Hide resolved
@mrsharm
Copy link
Member

mrsharm commented Jan 24, 2025

FYI - @Maoni0 @mrsharm @cshung - what preliminary tests can we run to validate the performance impact?

Sorry for the delay. I would run the microbenchmarks with and without this change on the pertinent hardware on the following tests given below for a sufficient number of iterations (as some of these exhibit a considerable amount of variance). The other considerations while running these is to ensure that the number of GCs is equivalent between the baseline and the comparand - this can be achieved by:

  1. Not removing the outliers: --outliers DontRemove.
  2. Setting a fixed number of invocations that'll be high enough to reduce the standard error: --invocationCount {InvocationCount}
  3. Setting a fixed number of iterations: --iterationCount 20.
- System.Numerics.Tests.Perf_BigInteger.Add(arguments: 65536*)
- System.Tests.Perf_GC<Byte>.AllocateArray(length: 1000, *)
- System.Tests.Perf_GC<Char>.AllocateArray(length: 1000, *)
- System.Tests.Perf_GC<Byte>.AllocateArray(length: 10000, *)
- System.Tests.Perf_GC<Char>.AllocateArray(length: 10000, *)
- System.Tests.Perf_GC<Byte>.AllocateUninitializedArray(length: 1000, *)
- System.Tests.Perf_GC<Char>.AllocateUninitializedArray(length: 1000, *)
- System.Tests.Perf_GC<Byte>.AllocateUninitializedArray(length: 10000, *)
- System.Tests.Perf_GC<Char>.AllocateUninitializedArray(length: 10000, *)
- System.Tests.Perf_GC<Byte>.NewOperator_Array(length: 1000)
- System.Tests.Perf_GC<Byte>.NewOperator_Array(length: 10000)
- System.Tests.Perf_GC<Char>.NewOperator_Array(length: 1000)
- System.Tests.Perf_GC<Char>.NewOperator_Array(length: 10000)
- System.IO.Tests.Perf_File.ReadAllBytesAsync(size: 104857600)
- System.Numerics.Tests.Perf_BigInteger.Subtract(arguments: 65536*)
- System.Collections.CtorGivenSize<String>.Array(size: 512)
- ByteMark.BenchBitOps
- System.IO.Tests.Perf_File.ReadAllBytes(size: 104857600)
- System.IO.Tests.Perf_File.ReadAllBytesAsync(size: 104857600)
- System.Linq.Tests.Perf_Enumerable.ToArray*
- System.Collections.Tests.Perf_BitArray.BitArrayByteArrayCtor(size: 512)

Once the microbenchmarks are run, the pertinent metrics would be the % difference in the time of execution of a test + the standard error of tests.

As a note: the following for the regression that was created because of us moving to a More Precise Write Barrier for x64: #73783 - seems like one of the affected microbenchmarks is already in the aforementioned list. I remember StackWalk being extremely volatile but still worth trying out with.

@cshung
Copy link
Member

cshung commented Jan 24, 2025

As we run the benchmarks, I would pay attention to ephemeral GC pause time, in particular the time spent on marking cards.

@a74nh
Copy link
Contributor Author

a74nh commented Jan 27, 2025

Sorry for the delay. I would run the microbenchmarks with and without this change on the pertinent hardware on the following tests given below for a sufficient number of iterations (as some of these exhibit a considerable amount of variance). The other considerations while running these is to ensure that the number of GCs is equivalent between the baseline and the comparand - this can be achieved by:

running most of the tests as suggested, I don't see any differences. Everything seems within error margins:



| Method                     | Job        | Toolchain                                                                          | length | pinned | Mean        | Error     | StdDev    | Median      | Min         | Max        | Ratio | MannWhitney(2%) | RatioSD | Gen0   | Gen1   | Gen2   | Allocated | Alloc Ratio |
|--------------------------- |----------- |----------------------------------------------------------------------------------- |------- |------- |------------:|----------:|----------:|------------:|------------:|-----------:|------:|---------------- |--------:|-------:|-------:|-------:|----------:|------------:|
| AllocateArray              | Job-CWSSJX | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 1000   | False  |   129.78 ns | 53.253 ns | 61.326 ns |   118.07 ns |   108.50 ns |   388.8 ns |  1.08 | Baseline        |    0.54 | 0.0152 |      - |      - |   1.98 KB |        1.00 |
| AllocateArray              | Job-EDJSHX | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 1000   | False  |   137.49 ns | 53.415 ns | 61.512 ns |   125.97 ns |   116.80 ns |   396.9 ns |  1.15 | Same            |    0.54 | 0.0152 |      - |      - |   1.98 KB |        1.00 |
|                            |            |                                                                                    |        |        |             |           |           |             |             |            |       |                 |         |        |        |        |           |             |
| AllocateUninitializedArray | Job-CWSSJX | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 1000   | False  |   103.60 ns | 51.462 ns | 59.263 ns |    89.10 ns |    88.63 ns |   354.8 ns |  1.11 | Baseline        |    0.66 | 0.0152 |      - |      - |   1.98 KB |        1.00 |
| AllocateUninitializedArray | Job-EDJSHX | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 1000   | False  |   103.35 ns | 51.294 ns | 59.070 ns |    88.76 ns |    88.21 ns |   353.4 ns |  1.10 | Same            |    0.65 | 0.0152 |      - |      - |   1.98 KB |        1.00 |
|                            |            |                                                                                    |        |        |             |           |           |             |             |            |       |                 |         |        |        |        |           |             |
| AllocateArray              | Job-CWSSJX | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 1000   | True   |   744.34 ns |  7.498 ns |  8.634 ns |   741.62 ns |   735.19 ns |   764.7 ns |  1.00 | Baseline        |    0.02 | 0.6364 | 0.6364 | 0.6364 |   1.98 KB |        1.00 |
| AllocateArray              | Job-EDJSHX | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 1000   | True   |   743.07 ns |  9.170 ns | 10.561 ns |   740.52 ns |   732.56 ns |   763.7 ns |  1.00 | Same            |    0.02 | 0.6364 | 0.6364 | 0.6364 |   1.98 KB |        1.00 |
|                            |            |                                                                                    |        |        |             |           |           |             |             |            |       |                 |         |        |        |        |           |             |
| AllocateUninitializedArray | Job-CWSSJX | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 1000   | True   |   735.06 ns | 10.791 ns | 12.426 ns |   728.98 ns |   720.78 ns |   757.2 ns |  1.00 | Baseline        |    0.02 | 0.6364 | 0.6364 | 0.6364 |   1.98 KB |        1.00 |
| AllocateUninitializedArray | Job-EDJSHX | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 1000   | True   |   748.82 ns |  8.844 ns | 10.185 ns |   743.99 ns |   736.23 ns |   767.8 ns |  1.02 | Same            |    0.02 | 0.6364 | 0.6364 | 0.6364 |   1.98 KB |        1.00 |
|                            |            |                                                                                    |        |        |             |           |           |             |             |            |       |                 |         |        |        |        |           |             |
| AllocateArray              | Job-CWSSJX | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 10000  | False  |   626.94 ns | 39.042 ns | 44.961 ns |   618.03 ns |   588.73 ns |   805.0 ns |  1.00 | Baseline        |    0.09 | 0.2879 |      - |      - |  19.55 KB |        1.00 |
| AllocateArray              | Job-EDJSHX | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 10000  | False  |   623.92 ns | 74.318 ns | 85.585 ns |   601.31 ns |   589.99 ns |   983.1 ns |  1.00 | Same            |    0.15 | 0.2879 |      - |      - |  19.55 KB |        1.00 |
|                            |            |                                                                                    |        |        |             |           |           |             |             |            |       |                 |         |        |        |        |           |             |
| AllocateUninitializedArray | Job-CWSSJX | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 10000  | False  |   142.84 ns | 17.866 ns | 20.575 ns |   138.18 ns |   134.39 ns |   228.9 ns |  1.01 | Baseline        |    0.17 | 0.2879 |      - |      - |  19.55 KB |        1.00 |
| AllocateUninitializedArray | Job-EDJSHX | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 10000  | False  |   149.25 ns | 16.513 ns | 19.016 ns |   146.35 ns |   137.79 ns |   227.3 ns |  1.06 | Same            |    0.16 | 0.2879 |      - |      - |  19.55 KB |        1.00 |
|                            |            |                                                                                    |        |        |             |           |           |             |             |            |       |                 |         |        |        |        |           |             |
| AllocateArray              | Job-CWSSJX | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 10000  | True   | 2,592.21 ns | 32.371 ns | 37.278 ns | 2,585.44 ns | 2,550.16 ns | 2,707.3 ns |  1.00 | Baseline        |    0.02 | 6.3182 | 6.3182 | 6.3182 |  19.56 KB |        1.00 |
| AllocateArray              | Job-EDJSHX | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 10000  | True   | 2,475.21 ns | 76.425 ns | 88.011 ns | 2,436.47 ns | 2,379.59 ns | 2,637.6 ns |  0.96 | Same            |    0.04 | 6.3182 | 6.3182 | 6.3182 |  19.56 KB |        1.00 |
|                            |            |                                                                                    |        |        |             |           |           |             |             |            |       |                 |         |        |        |        |           |             |
| AllocateUninitializedArray | Job-CWSSJX | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 10000  | True   | 2,438.40 ns | 43.482 ns | 50.074 ns | 2,444.35 ns | 2,330.27 ns | 2,527.3 ns |  1.00 | Baseline        |    0.03 | 6.3182 | 6.3182 | 6.3182 |  19.56 KB |        1.00 |
| AllocateUninitializedArray | Job-EDJSHX | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 10000  | True   | 2,449.01 ns | 35.429 ns | 40.800 ns | 2,448.20 ns | 2,338.34 ns | 2,520.9 ns |  1.00 | Same            |    0.03 | 6.3182 | 6.3182 | 6.3182 |  19.56 KB |        1.00 |
|                            |            |                                                                                    |        |        |             |           |           |             |             |            |       |                 |         |        |        |        |           |             |
| NewOperator_Array          | Job-CWSSJX | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 1000   | ?      |    98.53 ns | 49.747 ns | 57.289 ns |    86.26 ns |    74.80 ns |   340.4 ns |  1.11 | Baseline        |    0.67 | 0.0152 |      - |      - |   1.98 KB |        1.00 |
| NewOperator_Array          | Job-EDJSHX | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 1000   | ?      |    95.01 ns | 48.560 ns | 55.922 ns |    80.60 ns |    79.98 ns |   331.4 ns |  1.07 | Same            |    0.66 | 0.0152 |      - |      - |   1.98 KB |        1.00 |
|                            |            |                                                                                    |        |        |             |           |           |             |             |            |       |                 |         |        |        |        |           |             |
| NewOperator_Array          | Job-CWSSJX | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 10000  | ?      |   546.14 ns | 49.634 ns | 57.159 ns |   533.12 ns |   520.12 ns |   784.7 ns |  1.01 | Baseline        |    0.13 | 0.2879 |      - |      - |  19.55 KB |        1.00 |
| NewOperator_Array          | Job-EDJSHX | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 10000  | ?      |   551.71 ns | 52.751 ns | 60.748 ns |   537.58 ns |   528.97 ns |   807.3 ns |  1.02 | Same            |    0.13 | 0.2879 |      - |      - |  19.55 KB |        1.00 |


| Method | Job        | Toolchain                                                                          | arguments        | Mean        | Error      | StdDev     | Median      | Min         | Max         | Ratio | MannWhitney(2%) | RatioSD | Gen0   | Allocated | Alloc Ratio |
|------- |----------- |----------------------------------------------------------------------------------- |----------------- |------------:|-----------:|-----------:|------------:|------------:|------------:|------:|---------------- |--------:|-------:|----------:|------------:|
| Add    | Job-VIYVLB | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 1024,1024 bits   |   205.72 ns | 129.897 ns | 149.589 ns |    84.26 ns |    71.82 ns |   404.32 ns |  1.78 | Baseline        |    1.85 |      - |     160 B |        1.00 |
| Add    | Job-VRIONI | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 1024,1024 bits   |   203.72 ns | 129.080 ns | 148.649 ns |    83.54 ns |    72.15 ns |   400.73 ns |  1.76 | Same            |    1.84 |      - |     160 B |        1.00 |
|        |            |                                                                                    |                  |             |            |            |             |             |             |       |                 |         |        |           |             |
| Add    | Job-VIYVLB | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 16,16 bits       |    25.58 ns |   0.439 ns |   0.505 ns |    25.63 ns |    23.68 ns |    26.00 ns |  1.00 | Baseline        |    0.03 |      - |         - |          NA |
| Add    | Job-VRIONI | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 16,16 bits       |    24.67 ns |   1.307 ns |   1.506 ns |    24.99 ns |    21.90 ns |    26.31 ns |  0.97 | Same            |    0.06 |      - |         - |          NA |
|        |            |                                                                                    |                  |             |            |            |             |             |             |       |                 |         |        |           |             |
| Add    | Job-VIYVLB | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 65536,65536 bits | 3,591.60 ns |  74.221 ns |  85.473 ns | 3,559.69 ns | 3,555.19 ns | 3,919.99 ns |  1.00 | Baseline        |    0.03 | 0.1212 |    8224 B |        1.00 |
| Add    | Job-VRIONI | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 65536,65536 bits | 3,571.79 ns |  69.881 ns |  80.475 ns | 3,551.91 ns | 3,546.31 ns | 3,911.55 ns |  0.99 | Same            |    0.03 | 0.1212 |    8224 B |        1.00 |


| Method   | Job        | Toolchain                                                                          | arguments        | Mean        | Error      | StdDev     | Median      | Min         | Max         | Ratio | MannWhitney(2%) | RatioSD | Gen0   | Allocated | Alloc Ratio |
|--------- |----------- |----------------------------------------------------------------------------------- |----------------- |------------:|-----------:|-----------:|------------:|------------:|------------:|------:|---------------- |--------:|-------:|----------:|------------:|
| Subtract | Job-KDZVCP | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 1024,1024 bits   |   145.80 ns | 116.856 ns | 134.571 ns |    72.70 ns |    72.08 ns |   426.39 ns |  1.59 | Baseline        |    1.70 |      - |     152 B |        1.00 |
| Subtract | Job-KKPRIL | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 1024,1024 bits   |   143.24 ns | 118.524 ns | 136.493 ns |    72.22 ns |    71.90 ns |   431.54 ns |  1.57 | Same            |    1.72 |      - |     152 B |        1.00 |
|          |            |                                                                                    |                  |             |            |            |             |             |             |       |                 |         |        |           |             |
| Subtract | Job-KDZVCP | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 16,16 bits       |    26.41 ns |   0.836 ns |   0.963 ns |    26.88 ns |    24.34 ns |    27.34 ns |  1.00 | Baseline        |    0.05 |      - |         - |          NA |
| Subtract | Job-KKPRIL | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 16,16 bits       |    26.22 ns |   0.666 ns |   0.767 ns |    26.29 ns |    24.35 ns |    27.18 ns |  0.99 | Same            |    0.05 |      - |         - |          NA |
|          |            |                                                                                    |                  |             |            |            |             |             |             |       |                 |         |        |           |             |
| Subtract | Job-KDZVCP | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 65536,65536 bits | 3,483.97 ns |  61.051 ns |  70.306 ns | 3,466.17 ns | 3,458.38 ns | 3,780.31 ns |  1.00 | Baseline        |    0.03 | 0.1212 |    8216 B |        1.00 |
| Subtract | Job-KKPRIL | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 65536,65536 bits | 3,526.84 ns |  71.010 ns |  81.775 ns | 3,504.11 ns | 3,480.61 ns | 3,840.66 ns |  1.01 | Same            |    0.03 | 0.1212 |    8216 B |        1.00 |


| Method | Job        | Toolchain                                                                          | Size | Mean     | Error   | StdDev  | Median   | Min      | Max      | Ratio | MannWhitney(2%) | RatioSD | Gen0   | Allocated | Alloc Ratio |
|------- |----------- |----------------------------------------------------------------------------------- |----- |---------:|--------:|--------:|---------:|---------:|---------:|------:|---------------- |--------:|-------:|----------:|------------:|
| Array  | Job-CZKOLC | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 512  | 152.8 ns | 7.44 ns | 8.56 ns | 149.6 ns | 147.4 ns | 186.8 ns |  1.00 | Baseline        |    0.07 | 0.0606 |   4.02 KB |        1.00 |
| Array  | Job-FQHBTF | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 512  | 155.3 ns | 4.66 ns | 5.36 ns | 154.5 ns | 151.6 ns | 177.2 ns |  1.02 | Same            |    0.06 | 0.0606 |   4.02 KB |        1.00 |


| Method  | Job        | Toolchain                                                                          | input       | Mean      | Error    | StdDev    | Median    | Min       | Max       | Ratio | MannWhitney(2%) | RatioSD | Gen0   | Allocated | Alloc Ratio |
|-------- |----------- |----------------------------------------------------------------------------------- |------------ |----------:|---------:|----------:|----------:|----------:|----------:|------:|---------------- |--------:|-------:|----------:|------------:|
| ToArray | Job-QHOIJP | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | ICollection |  41.88 ns | 9.097 ns | 10.476 ns |  37.78 ns |  36.30 ns |  80.16 ns |  1.04 | Baseline        |    0.30 | 0.0061 |     424 B |        1.00 |
| ToArray | Job-GOWGBS | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | ICollection |  43.15 ns | 9.478 ns | 10.915 ns |  36.91 ns |  36.21 ns |  79.58 ns |  1.07 | Same            |    0.31 | 0.0061 |     424 B |        1.00 |
|         |            |                                                                                    |             |           |          |           |           |           |           |       |                 |         |        |           |             |
| ToArray | Job-QHOIJP | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | IEnumerable | 287.98 ns | 5.110 ns |  5.885 ns | 286.38 ns | 285.61 ns | 312.59 ns |  1.00 | Baseline        |    0.03 | 0.0061 |     456 B |        1.00 |
| ToArray | Job-GOWGBS | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | IEnumerable | 289.73 ns | 4.845 ns |  5.580 ns | 287.99 ns | 287.74 ns | 313.07 ns |  1.01 | Same            |    0.03 | 0.0061 |     456 B |        1.00 |


| Method                | Job        | Toolchain                                                                          | Size | Mean      | Error     | StdDev    | Median    | Min       | Max       | Ratio | MannWhitney(2%) | RatioSD | Gen0   | Allocated | Alloc Ratio |
|---------------------- |----------- |----------------------------------------------------------------------------------- |----- |----------:|----------:|----------:|----------:|----------:|----------:|------:|---------------- |--------:|-------:|----------:|------------:|
| BitArrayByteArrayCtor | Job-WNOFTX | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 4    |  21.91 ns | 11.631 ns | 13.395 ns |  14.75 ns |  14.65 ns |  57.33 ns |  1.24 | Baseline        |    0.88 |      - |      64 B |        1.00 |
| BitArrayByteArrayCtor | Job-QPXJRV | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 4    |  22.39 ns | 11.757 ns | 13.540 ns |  15.89 ns |  15.70 ns |  60.19 ns |  1.27 | Same            |    0.90 |      - |      64 B |        1.00 |
|                       |            |                                                                                    |      |           |           |           |           |           |           |       |                 |         |        |           |             |
| BitArrayByteArrayCtor | Job-WNOFTX | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 512  | 142.08 ns |  5.946 ns |  6.848 ns | 140.73 ns | 138.30 ns | 170.18 ns |  1.00 | Baseline        |    0.06 | 0.0076 |     568 B |        1.00 |
| BitArrayByteArrayCtor | Job-QPXJRV | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 512  | 139.35 ns |  5.774 ns |  6.650 ns | 137.68 ns | 136.98 ns | 167.37 ns |  0.98 | Same            |    0.06 | 0.0076 |     568 B |        1.00 |

@a74nh
Copy link
Contributor Author

a74nh commented Jan 27, 2025

I've been using your orchard.sh script that runs on a single machine, on 4 cores (+1 for wrk). I don't see any improvement in reqs per sec, although not sure if that's a good enough test.

Afair it's not bottle-necked in Write-Barrier + presumably, your PR is supposed to decrease average GC pause rather than WB's throughput? So you might want to look at the GC stats? the orchard.sh should have USE_DOTNET_TRACE property that you need to set to 1 to grab traces (and set DOTNET_TRACE_ARGS to listen to gc events specifically)

Running Orchard with the tracing enabled....

Two runs using head:

Screenshot 2025-01-27 122757
Screenshot 2025-01-27 124529 base2

Two runs using the PR:

Screenshot 2025-01-27 122824
Screenshot 2025-01-27 124456 new2

Those figures look quite a bit better on the PR

@a74nh a74nh marked this pull request as ready for review January 27, 2025 17:38
@a74nh
Copy link
Contributor Author

a74nh commented Jan 27, 2025

Added a file to the docs/ folder with pseudo code for the writebarrier function.

I'm taking this out of draft now.

*(g_sw_ww_table + (dst>>11)) = 0xff


// Return if the reference is not in the heap
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think that this is correct. This checks whether reference is not in ephemeral generation (ie Gen 0).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps, worth mentioning then that there is a Checked version of the same barrier with "Is not on heap" checks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the doc, including adding JIT_CheckedWriteBarrier() pseudo code.

@jkotas
Copy link
Member

jkotas commented Jan 29, 2025

@EgorBot -linux_azure_cobalt100

using BenchmarkDotNet.Attributes;

public class MyBench
{
    object Dst1;
    object Dst2;
    object Dst3;
    object Dst4;

    static object Value = new();

    [Benchmark]
    public void WB_nonephemeral()
    {
        // Write non-ephemeral reference
        Dst1 = Value;
        Dst2 = Value;
        Dst3 = Value;
        Dst4 = Value;
    }
}

@cshung
Copy link
Member

cshung commented Jan 30, 2025

I wonder if the GC can make some automatically decision here to avoid wasted effort here.
Without a reasonably large Gen2 or a reasonably high false positive rate on the cards, there isn't much reason to go precise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arch-arm64 area-VM-coreclr community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants