-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[JIT] [APX] Enable additional General Purpose Registers. #108799
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
CC @jakobbotsch and @tannergooding for code review. |
@DeepakRajendrakumaran What is the status of this PR? It's marked as ready but the description says it's built on top of #108796 that is not marked as ready. |
Thanks for pointing that out. It has some dependencies on other PRs - specifically the Rex2 encoding PR. Considering that, do you have a suggestion on how to mark this for now? |
eb47ede
to
ec3388f
Compare
a3e8331
to
6bbccb4
Compare
Now that CPUID changes have merged, ran superpmi TP and I have a problem Ran the scripts shared by Kunal a while back to debug why this is happening The following is for libraries
|
Trying to further make sure the Rex2 changes are not causing TP regression. We can safely conclude the TP regression is from eGPR enablement The following is with/without Rex2 changes(without reg alloc changes)Overall (+0.08% to +0.23%)
MinOpts (+0.28% to +0.48%)
FullOpts (+0.08% to +0.14%)
With Rex2 as base and eGPR changes as diffOverall (+3.60% to +4.65%)
MinOpts (+6.09% to +8.79%)
FullOpts (+3.47% to +4.29%)
|
5466739
to
a9d71a2
Compare
b93e387
to
9905646
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Went through the first pass, need to evaluate where TP regression is coming from. However, I still see some asmdiffs...can you please fix it?
src/coreclr/jit/lsra.cpp
Outdated
@@ -12534,6 +12555,9 @@ void LinearScan::verifyResolutionMove(GenTree* resolutionMove, LsraLocation curr | |||
LinearScan::RegisterSelection::RegisterSelection(LinearScan* linearScan) | |||
{ | |||
this->linearScan = linearScan; | |||
#if defined(TARGET_AMD64) | |||
rbmAllInt = linearScan->compiler->get_RBM_ALLINT(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason why we need it here instead of LinearScan
ctor (which you are already doing)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
src/coreclr/jit/emit.h
Outdated
@@ -742,6 +743,7 @@ class emitter | |||
// The instrDescCGCA struct's member keeping the GC-ness of the first return register is _idcSecondRetRegGCType. | |||
GCtype _idGCref : 2; // GCref operand? (value is a "GCtype") | |||
|
|||
#if !defined(TARGET_AMD64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason for this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alignment - having _idReg1/_idReg2 here with increased size caused padding and increased size even more
src/coreclr/jit/regMaskTPOps.cpp
Outdated
@@ -62,7 +62,12 @@ bool regMaskTP::IsRegNumInMask(regNumber reg, var_types type) const | |||
// | |||
void regMaskTP::AddGprRegs(SingleTypeRegSet gprRegs) | |||
{ | |||
// RBM_ALLINT is not known at compile time on TARGET_AMD64 since it's dependent on APX support. | |||
#if defined(TARGET_AMD64) | |||
assert((gprRegs == RBM_NONE) || ((gprRegs & RBM_ALLINT_STATIC_ALL) != RBM_NONE)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for non-APX machines, gpr will still be 0-15
and with this assert, we will allow float register to get set, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for non-APX machines, gpr will still be
0-15
and with this assert, we will allow float register to get set, right?
Not really. On both APX and non-apx machines bits 0-23 will be eGPR and 24-55 SIMD. We just make sure that 16-23 are not used for non APX machines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We just make sure that 16-23 are not used for non APX machines
how are we making sure? worth adding some asserts.
src/coreclr/jit/compiler.hpp
Outdated
|
||
// RBM_ALLINT is not known at compile time on TARGET_AMD64 since it's dependent on APX support. Deprecated???? | ||
#if defined(TARGET_AMD64) | ||
sprintf_s(regmask, cchRegMask, REG_MASK_INT_FMT, (mask & RBM_ALLINT_STATIC_ALL).getLow()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we need RBM_ALLINT_STATIC_ALL
here? it should just use RBM_ALLINT
and it should return the right mask depending on if high int registers are available or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RBM_ALLINT and it should return the right mask depending on if high int registers are available or not
- I'm not sure we can do that here. RBM_ALLINT
calls get_RBM_ALLINT()
. One way to make it work would be to move this method to part of compiler class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thinking about it...RBM_ALLINT_STATIC_ALL
should be the one we should be using and we can have it for both x86 and x64 for consistency.
Alternatively, if you decide to add rbmAllInt
on x86, we can just use RBM_ALLINT
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have left the *STATIC_ALL
version in all asserts where relevant
https://github.com/dotnet/runtime/pull/108799/files#r1898770314
https://github.com/dotnet/runtime/pull/108799/files#r1898771126
src/coreclr/jit/compiler.hpp
Outdated
// RBM_ALLINT is not known at compile time on TARGET_AMD64 since it's dependent on APX support. These are used by GC | ||
// exclusively | ||
#if defined(TARGET_AMD64) | ||
printf(REG_MASK_INT_FMT, (mask & RBM_ALLINT_STATIC_ALL).getLow()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
likewise here...can just use RBM_ALLINT
?
@@ -3136,4 +3347,51 @@ inline SingleTypeRegSet LinearScan::BuildEvexIncompatibleMask(GenTree* tree) | |||
#endif | |||
} | |||
|
|||
inline bool LinearScan::DoesThisUseGPR(GenTree* op) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add method docs for this and below method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Method docs please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
src/coreclr/jit/lsraxarch.cpp
Outdated
return false; | ||
} | ||
|
||
inline SingleTypeRegSet LinearScan::BuildApxIncompatibleGPRMask(GenTree* tree, bool forceLowGpr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the goal of this method?
src/coreclr/jit/lsraxarch.cpp
Outdated
SingleTypeRegSet op1Candidates = candidates; | ||
SingleTypeRegSet op2Candidates = candidates; | ||
int srcCount = 0; | ||
// SingleTypeRegSet op1Candidates = candidates; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are lot of such comments in this file. can you please delete them?
|
||
// We are dealing exclusively with HWIntrinsics here | ||
return (op->AsHWIntrinsic()->OperIsBroadcastScalar() || | ||
(op->AsHWIntrinsic()->OperIsMemoryLoad() && DoesThisUseGPR(op->AsHWIntrinsic()->Op(1)))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we only care if Op(1)
uses GPR, not any other operand?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. For xarch
, Op(1)
is the memory address for nodes satifying GenTreeHWIntrinsic::OperIsMemoryLoad
with the exception of 4 intrinsics(those 4 will not use this). And GPR is likely to be used only during mem addressing in these cases
src/coreclr/jit/lsraxarch.cpp
Outdated
else | ||
{ | ||
// ToDo-APX : imul currently doesn't have rex2 support. So, cannot use R16-R31. | ||
dstCandidates = BuildApxIncompatibleGPRMask(tree, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calls to BuildApxIncompatibleGPRMask
for many nodes seems expensive. Wondering if we can do something like:
- at the top just set
SingleTypeRegSet incompatibleGprMask = compiler->canUseApxEncoding() ? lowGPRRegs() : RBM_NONE;
- Places where you are passing
forceLowGpr= true
can instead just useincompatibleGprMask
. - Places where you are not forcing lowGPr, can just use
DoesThisUseGPR(tree) ? incompatibleGprMask : RBM_NONE
Also, might worth caching the value of lowGPRRegs()
because currently it is evaluated every time to be (availableIntRegs & RBM_LOWINT.GetIntRegSet())
and I see lowGprRegs()
is used at lot of places.
It seems from your latest change, there are still asmdiffs coming up. I think there are places in vs. what CI is showing what does it show for you locally? |
That was happening because my environment was not setup correctly. I can now see the same TP diffs that is shown in CI. |
This reduced it by somewhere around 0.8%. Without that change for comparison - https://github.com/dotnet/runtime/pull/111004/checks?check_run_id=34998081643 |
Just a note that the TP regression we see here will impact not only non-APX machines but also AMD machines which do not have APX feature. We should add that consideration too while working on this on how we can reduce or have no impact on AMD. |
99438b6
to
dc3a1e8
Compare
Looks like no asm diffs now. Still some small x64 TP regression but I assume that is inevitable and expected. fyi @DeepakRajendrakumaran there are merge conflicts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some minor comments and questions. I think it is in good shape now, might just need one more round of updates.
@@ -771,7 +771,8 @@ class LinearScan : public LinearScanInterface | |||
LSRA_LIMIT_SMALL_SET = 0x3, | |||
#if defined(TARGET_AMD64) | |||
LSRA_LIMIT_UPPER_SIMD_SET = 0x2000, | |||
LSRA_LIMIT_MASK = 0x2003 | |||
LSRA_LIMIT_EXT_GPR_SET = 0x4000, | |||
LSRA_LIMIT_MASK = 0x6003 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you verify if this gets automatically picked up with the JitStressRegs
value we pass in CI? One of the place to see that list can be in:
runtime/src/coreclr/scripts/superpmi_replay.py
Lines 30 to 38 in 05d94d9
"JitStressRegs=0", | |
"JitStressRegs=1", | |
"JitStressRegs=2", | |
"JitStressRegs=3", | |
"JitStressRegs=4", | |
"JitStressRegs=8", | |
"JitStressRegs=0x10", | |
"JitStressRegs=0x80", | |
"JitStressRegs=0x1000", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have checked this setting locally. But didn't know it needs to be enabled on CI. Will add and check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -37,6 +37,7 @@ | |||
DOTNET_EnableSSE41; | |||
DOTNET_EnableSSE42; | |||
DOTNET_EnableSSSE3; | |||
DOTNET_EnableAPX; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, can you please remove it unless we want to add pipeline for it?
src/coreclr/jit/targetamd64.h
Outdated
|
||
#define RBM_ALLINT_INIT (RBM_INT_CALLEE_SAVED | RBM_INT_CALLEE_TRASH_INIT) | ||
#define RBM_ALLINT get_RBM_ALLINT() | ||
#define RBM_INT_CALLEE_TRASH_STATIC_ALL (RBM_INT_CALLEE_TRASH_INIT | RBM_HIGHINT) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RBM_INT_CALLEE_TRASH_INIT
serves similar purpose as other #define
with name *_STATIC_*
in it. Can you have consistent nomenclature, either use *_INIT_*
or *_STATIC_*
in their names?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently how I use them is as follows
*_INIT*
is the list with registers which are definitely available on targetamd64
For e.g. RBM_INT_CALLEE_TRASH_INIT = (RBM_EAX|RBM_ECX|RBM_EDX|RBM_R8|RBM_R9|RBM_R10|RBM_R11)
(No eGPRs). This allows us to dynamically add available registers during runtime based on APX availability
*_STATIC_ALL
is a static list of all possible registers
For e.g. RBM_INT_CALLEE_TRASH_STATIC_ALL = (RBM_EAX|RBM_ECX|RBM_EDX|RBM_R8|RBM_R9|RBM_R10|RBM_R11) | (RBM_R16|RBM_R17|RBM_R18|RBM_R19|RBM_R20|RBM_R21|RBM_R22|RBM_R23)
(includes eGPRs). The only purpose of this is to be able to do compile time static asserts.
So they have different uses. Are you asking to rename to RBM_INT_CALLEE_TRASH_INIT_ALL
and RBM_INT_CALLEE_TRASH_STATIC_ALL
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that makes sense and *_INIT
seems fine to me....I am little uncomfortable with the work _STATIC_
in the other #define
, so if you can think of a better name that will be great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated names by removing STATIC
. RBM_INT_CALLEE_ALL
, RBM_ALLINT_ALL
describes these I think
@@ -43,7 +43,7 @@ inline static bool isHighGPReg(regNumber reg) | |||
#ifdef TARGET_AMD64 | |||
// TODO-apx: the definition here is incorrect, we will need to revisit this after we extend the register definition. | |||
// for now, we can simply use REX2 as REX. | |||
return ((reg >= REG_R8) && (reg <= REG_R15)); | |||
return ((reg >= REG_R16) && (reg <= REG_R23)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't the range for non-APX cases still be R8~R15
? I mean we can have it extend to R8~R23
because we know there will not be any registers we will see between R16~R23
for non-APX, but it should at least start from R8
rather than R16
, right?
@kunalspathak Responding to this comment : I believe the consensus #108799 (comment) was to keep the setting |
Can you verify if setting |
I'm able to verify that APX code can be generated using the following Build APX branch with following This is the tricky bit. We need a custom Go to jitutils and run Set following env variables
Run from runtime repo root. |
You don't need coredistools to just compile the method with altjit to make sure it doesn't hit any assert. coredistools is needed for pipelines like asmdiffs, which we don't want to add right now. But we can have a pipeline that runs coreclr Pri0/Pri1 tests with APX=On. |
Can this be done as a follow-up PR? It doesn't seem necessary to gate this PR on that addition. |
I am ok with that as long as we have a tracking issue to add that coverage sooner rather than later. |
Failures in jitstress-isas-avx512 seems to be #112163. Can you confirm if |
Did the following Build repo and tests run tests normally run tests with tiered compilation off
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/ba-g failures seems unrelated |
What this PR does
Currently we are adding just 8 new registers so that total register number does not exceed 64
. This is based on the conversation on this PR and following conclusion : linkA LSRA_LIMIT_EXT_GPR_SET register stress mode to force eGPR register usage when possible.
Some minor changes to turn on Rex2 encoding with eGPR
Temporary changes to mask away eGPR for currently un-supported instructions - primarily ones requiring eEVEX + imul + movszx (This will be removed once we have support for these but are essentially while we do not have eEVEX support)
Minor flags to gets altjit to work
Testing
With APX disabled
for TP/asmdiff : link
With APX enabled
ASMDIFF
Code size increases due to Rex2 but PerfScore improves. Note : This is with just a subset of x64 instructions(those requiring eEVEX will be given access to eGPR as part of upcoming changes) having access to eGPR and with just 8 eGPR enabled