-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clang/LLVM can't generate optimal loop in MIPS. #293
Comments
At O2, I see this output:
This is very close to correct. It seems that the loop is being turned into canonical form, where the loop induction variable counts up from zero. The IR for the loop is: do.body: ; preds = %do.body, %entry
%indvars.iv = phi i64 [ %indvars.iv.next, %do.body ], [ %0, %entry ]
%arrayidx = getelementptr inbounds i8, i8 addrspace(200)* %in, i64 %indvars.iv
%1 = load i8, i8 addrspace(200)* %arrayidx, align 1, !tbaa !3
%arrayidx2 = getelementptr inbounds i8, i8 addrspace(200)* %out, i64 %indvars.iv
store i8 %1, i8 addrspace(200)* %arrayidx2, align 1, !tbaa !3
%indvars.iv.next = add nsw i64 %indvars.iv, -1
%2 = trunc i64 %indvars.iv to i32
%tobool = icmp eq i32 %2, 0
br i1 %tobool, label %do.end, label %do.body It looks as if loop-strength reduction is doing this: do.body: ; preds = %do.body, %entry
%lsr.iv = phi i32 [ %lsr.iv.next, %do.body ], [ %1, %entry ]
%indvars.iv = phi i64 [ %indvars.iv.next, %do.body ], [ %0, %entry ]
%arrayidx = getelementptr inbounds i8, i8 addrspace(200)* %in, i64 %indvars.iv
%2 = load i8, i8 addrspace(200)* %arrayidx, align 1, !tbaa !3
%arrayidx2 = getelementptr inbounds i8, i8 addrspace(200)* %out, i64 %indvars.iv
store i8 %2, i8 addrspace(200)* %arrayidx2, align 1, !tbaa !3
%indvars.iv.next = add nsw i64 %indvars.iv, -1
%lsr.iv.next = add i32 %lsr.iv, 1
%tobool = icmp eq i32 %lsr.iv.next, 0
br i1 %tobool, label %do.end, label %do.body This seems to be where the extra induction variable is introduced. |
This is likely to be difficult to fix, because we typically need LSR to generate good code for MIPS. Without it, we end up with pointers being recomputed from the induction variable, which is a problem for MIPS with the lack of complex addressing modes, because we end up with more complex computations in loops. |
It appears that LSR has a number of hooks to understand target addressing modes that we're not using... |
Unfortunately, adding the the hooks doesn't seem to change what LSR does and without LSR we get even worse code:
This is almost better. The problem appears to be caused by the
This is still not an improvement in the instruction count, but it does mean that the target-independent optimisers are doing the right thing. Now the issue is that we can't fill the delay slot. Currently, the delay-slot filler is fairly simple. To fill this, we'd have to change the branch condition from $2 to $zero. This kind of analysis is pretty hard to do in the machine code (where we know about the delay slot), because we'd need to understand the induction value. In pure MIPS mode, we end up with this, which is even worse:
This appears to be because the MIPS back end isn't using the register+register addressing mode correctly, so we end up needing two adds, which means that the store can move into the delay slot trivially. If the MIPS back end used the register-register addressing mode better, then this wouldn't be needed. |
Hmm, apparently I imagined the register-register addressing mode. That explains why this case doesn't matter for MIPS... |
FWIW, the loop with the nop in the delay slot is the version I kept getting in memcpy. I must have been using longs (the type of "length" is some sort of parameter) and getting the simple case, but couldn't fill the delay. It would move the extra iteration out of the loop instead of using the delay slot, as you've shown above. |
I think that the only way to fix this is to teach the delay slot filler about this pattern. It's probably not worth it for memcpy, but if we see this in other benchmarks then it might be. |
Can we get an assembly macro for a loop with the iterator in the branch delay slot? Exiting the loop when zero (when equal to another value takes an extra register) and decremented by a parameter? MIPSLoopToZero(variable to decrement, c statements to do in loop, amount to decrement each time) |
This should work: .macro copyloop load, store, src, dest, inc, induction=$4, tmp=$0
1:
\load \tmp, \induction, 0(\src)
\store \tmp, \induction, 0(\dest)
daddiu \induction, \induction, \inc
bnez $induction, 1b
.endm
copyloop clb, csb, $c3, $c4, -1
copyloop clw, csw, $c3, $c4, -4
copyloop cld, csd, $c3, $c4, -8
copyloop clc, csc, $c3, $c4, -16 |
Seems like we are actually better than MIPS here so not filling the delay slot isn't too bad. void doLoop_cap(char * __capability in, char * __capability out, int i) {
do {
out[i]=in[i];
} while (i--);
}
void doLoop_ptr(char * in, char * out, int i) {
do {
out[i]=in[i];
} while (i--);
}
|
Closing this since we don't care about improving MIPS codegen anymore. |
MemCopyLoop.zip
C function:
void doLoop(char * __capability in, char * __capability out, int i)
{
do {
out[i]=in[i];
} while (i--);
}
Generated assembly for loop:
not $2, $4
.LBB0_1:
clb $1, $4, 0($c3)
addiu $2, $2, 1
csb $1, $4, 0($c4)
bnez $2, .LBB0_1
daddiu $4, $4, -1
Desired assembly for loop:
.LBB0_1:
clb $1, $4, 0($c3)
csb $1, $4, 0($c4)
bnez $4, .LBB0_1
daddiu $4, $4, -1
Notes:
This particular variation of code generation is new and I've seen other contortions, so hopefully we can find a fix that's deeper than just spotting the pattern and replacing. It seems like, on some early level, branch delay semantics are not known and the instructions are contorted to put everything before the branch, eliminating the opportunity in the later levels to see the obvious use of the branch delay slot.
The text was updated successfully, but these errors were encountered: