Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arm64: Implement region write barriers #111636

Open
wants to merge 12 commits into
base: main
Choose a base branch
from
22 changes: 16 additions & 6 deletions src/coreclr/vm/arm64/asmhelpers.S
Original file line number Diff line number Diff line change
Expand Up @@ -194,8 +194,9 @@ LEAF_END ThePreStubPatch, _TEXT
LEAF_ENTRY JIT_UpdateWriteBarrierState, _TEXT
PROLOG_SAVE_REG_PAIR_INDEXED fp, lr, -16

// x0-x7, x10 will contain intended new state
// x0-x7, x10-x11, x13-x14 will contain intended new state
// x8 will preserve skipEphemeralCheck
// x9 will preserve writeableOffset
// x12 will be used for pointers

mov x8, x0
Expand Down Expand Up @@ -231,12 +232,21 @@ LOCAL_LABEL(EphemeralCheckEnabled):
PREPARE_EXTERNAL_VAR g_highest_address, x12
ldr x6, [x12]

PREPARE_EXTERNAL_VAR g_region_to_generation_table, x12
ldr x7, [x12]

PREPARE_EXTERNAL_VAR g_region_shr, x12
ldr w10, [x12]

PREPARE_EXTERNAL_VAR g_region_use_bitwise_write_barrier, x12
ldr w11, [x12]

#ifdef WRITE_BARRIER_CHECK
PREPARE_EXTERNAL_VAR g_GCShadow, x12
ldr x7, [x12]
ldr x13, [x12]

PREPARE_EXTERNAL_VAR g_GCShadowEnd, x12
ldr x10, [x12]
ldr x14, [x12]
#endif

// Update wbs state
Expand All @@ -247,12 +257,12 @@ LOCAL_LABEL(EphemeralCheckEnabled):
stp x0, x1, [x12], 16
stp x2, x3, [x12], 16
stp x4, x5, [x12], 16
str x6, [x12], 8
stp x6, x7, [x12], 16
stp w10, w11, [x12], 8
#ifdef WRITE_BARRIER_CHECK
stp x7, x10, [x12], 16
stp x13, x14, [x12], 16
#endif


EPILOG_RESTORE_REG_PAIR_INDEXED fp, lr, 16
EPILOG_RETURN
LEAF_END JIT_UpdateWriteBarrierState
Expand Down
63 changes: 58 additions & 5 deletions src/coreclr/vm/arm64/patchedcode.S
Original file line number Diff line number Diff line change
Expand Up @@ -142,33 +142,80 @@ LOCAL_LABEL(ShadowUpdateEnd):
#ifdef FEATURE_USE_SOFTWARE_WRITE_WATCH_FOR_GC_HEAP
// Update the write watch table if necessary
ldr x12, LOCAL_LABEL(wbs_sw_ww_table)
cbz x12, LOCAL_LABEL(CheckCardTable)
cbz x12, LOCAL_LABEL(CheckCardTableBounds)
add x12, x12, x14, lsr #0xc // SoftwareWriteWatch::AddressToTableByteIndexShift
ldrb w17, [x12]
cbnz x17, LOCAL_LABEL(CheckCardTable)
cbnz x17, LOCAL_LABEL(CheckCardTableBounds)
mov w17, #0xFF
strb w17, [x12]
#endif

LOCAL_LABEL(CheckCardTable):
// Branch to Exit if the reference is not in the Gen0 heap
LOCAL_LABEL(CheckCardTableBounds):
// Branch to Exit if the reference is not in the heap
ldr x12, LOCAL_LABEL(wbs_ephemeral_low)
ldr x17, LOCAL_LABEL(wbs_ephemeral_high)
EgorBo marked this conversation as resolved.
Show resolved Hide resolved
cmp x15, x12
ccmp x15, x17, #0x2, hs
bhs LOCAL_LABEL(Exit)

// Region Checks

// Check if using regions
ldr x17, LOCAL_LABEL(wbs_region_to_generation_table)
cbz x17, LOCAL_LABEL(CheckCardTable)

// Calculate region locations
ldr w12, LOCAL_LABEL(wbs_region_shr)
lsr x15, x15, x12
add x15, x15, x17 // x15 = (RHS >> wbs_region_shr) + wbs_region_to_generation_table
lsr x12, x14, x12
add x12, x12, x17 // x12 = (LHS >> wbs_region_shr) + wbs_region_to_generation_table

// Check whether the region we're storing into is gen 0 - nothing to do in this case
ldrb w12, [x12]
cbz w12, LOCAL_LABEL(Exit)

// Check this is going from old to young
ldrb w15, [x15]
cmp w15, w12
bhs LOCAL_LABEL(Exit)

// Bitwise write barriers only
ldr w17, LOCAL_LABEL(wbs_region_use_bitwise_write_barrier)
cbz w17, LOCAL_LABEL(CheckCardTable)

// Check if we need to update the card table
lsr w17, w14, 8
and w17, w17, 7
movz w15, 1
lsl w17, w15, w17 // w17 = 1 << (RHS >> 8 && 7)
ldr x12, LOCAL_LABEL(wbs_card_table)
add x15, x12, x14, lsr #11
ldrb w12, [x15]
ldrb w12, [x15] // w12 = [(RHS >> 11) + g_card_table]
cmp w12, w17
beq LOCAL_LABEL(Exit)

// Update the card table
// TODO: Is this correct? the AMD64 code is odd.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to be lock-free atomic update (CAS). If two threads are setting two different bits in the card table, we need to make sure that one does not overwrite the update done by the other.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I wasn't sure if this needed to be atomic or not.

I also need to switch this to use test instead of cmp, to test for just the single bit.

Copy link
Member

@kunalspathak kunalspathak Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't use LSE atomics for this, we might see perf impact. The suggestion would be to check if LSE atomics is present in write barrier manager and use "BIT" version (precise write barrier) otherwise fallback to "BYTE" version (non-precise write barrier).

Edit: For AOT scenarios, we might be better off using just the BYTE version because we won't know if LSE atomics is available on target machine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The atomic bit store without LSE requires a ldaxrb+orr+stlxrb+cbnz loop and an additional temp register. With LSE it can be done with a single stsetb. Agree this should only be done for LSE.

I'll add an LSE check in the GC code on init, and if false then unset region_use_bitwise_write_barrier. Then use LSE for bit write barriers (which I'll have to do in raw hex for it to compile)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add an LSE check in the GC code on init

It can be in the VM where we have the infrastructure to detect LSE atomics, it does not need to be in the GC code. region_use_bitwise_write_barrier is not a hard requirement - it is ok for VM to ignore the request to use bitwise write barrier.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is up to 30% regression in write barrier micro-benchmarks on Cobalt 100: EgorBot/runtime-utils#271 (comment)

I think it would be a good idea to have multiple static versions of the write barrier to minimize the regression and to provide option to go to non-bitwise write barrier like we have on x64.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My hope was to split this work into two pieces. First this PR, and then a second for the multiple versions. But, it sounds like the regressions would block this from going in.

If we assume that splitting into multiple versions will get rid of the regressions, then has the current version of this PR shown enough improvements in the GC for it to be a worthwhile change? I think the stats from OrchardCMS do show that, but I'm not sure how significant they are. If so, then I can look at doing the splitting.

Copy link
Member

@EgorBo EgorBo Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we assume that splitting into multiple versions will get rid of the regressions, then has the current version of this PR shown enough improvements in the GC for it to be a worthwhile change?

I assume we can estimate that locally first?

First this PR, and then a second for the multiple versions.

It probably indeed would be better to start from splitting, e.g. the current WB has a redundant "is ephemeral" checks when Server GC is enabled - I tried to handle it in #106934

I think the #111636 (comment) do show that, but I'm not sure how significant they are.

The improvements for GC pauses indeed look cool and hopefully will have a noticeable impact for certain workloads, however, if I remember correctly, we got some complains on throughput after the x64 precise write barriers landed (it basically regressed performance in many microbenchmarks, i.e. #74014)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we try to avoid large scale performance regression-improvements zig-zags. They create noise in our performance tracing system that takes extra work to deal with.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let's avoid committing this as it is then.

if I remember correctly, we got some complains on throughput after the x64 precise write barriers landed (it basically regressed performance in many microbenchmarks, i.e. #74014)

Was this before x64 added multiple versions?

It probably indeed would be better to start from splitting, e.g. the current WB has a redundant "is ephemeral" checks when Server GC is enabled - I tried to handle it in #106934

Looking at the writebarriermanager code for x64, I think it'll be fairly easy to move all of it into a new file and make it work for Arm64. Resulting in the same number of versions on Arm64. The code to edit the constants would just write to the addresses at the end of the function (instead of inline like x64). That would avoid writing "new" functionality, and it'd be useable for other architectures too. Unless there are any reasons for not reusing the writebarriermanager?

orr w12, w12, w17
strb w12, [x15]
b LOCAL_LABEL(CheckCardBundleTable)

// End of Region Checks

LOCAL_LABEL(CheckCardTable):
// Check if we need to update the card table
ldr x12, LOCAL_LABEL(wbs_card_table)
add x15, x12, x14, lsr #11
ldrb w12, [x15] // w12 = [(RHS >> 11) + g_card_table]
cmp x12, 0xFF
beq LOCAL_LABEL(Exit)

// Update the card table
mov x12, 0xFF
strb w12, [x15]

LOCAL_LABEL(CheckCardBundleTable):
#ifdef FEATURE_MANUALLY_MANAGED_CARD_BUNDLES
// Check if we need to update the card bundle table
ldr x12, LOCAL_LABEL(wbs_card_bundle_table)
Expand Down Expand Up @@ -208,6 +255,12 @@ LOCAL_LABEL(wbs_lowest_address):
.quad 0
LOCAL_LABEL(wbs_highest_address):
.quad 0
LOCAL_LABEL(wbs_region_to_generation_table):
.quad 0
LOCAL_LABEL(wbs_region_shr):
.word 0
LOCAL_LABEL(wbs_region_use_bitwise_write_barrier):
.word 0
#ifdef WRITE_BARRIER_CHECK
LOCAL_LABEL(wbs_GCShadow):
.quad 0
Expand Down
Loading