Clang/LLVM can't generate optimal loop in MIPS. #293

jonwoodruff · 2017-08-31T09:50:42Z

C function:
void doLoop(char * __capability in, char * __capability out, int i)
{
do {
out[i]=in[i];
} while (i--);
}

Generated assembly for loop:
not $2, $4
.LBB0_1:
clb $1, $4, 0($c3)
addiu $2, $2, 1
csb $1, $4, 0($c4)
bnez $2, .LBB0_1
daddiu $4, $4, -1

Desired assembly for loop:
.LBB0_1:
clb $1, $4, 0($c3)
csb $1, $4, 0($c4)
bnez $4, .LBB0_1
daddiu $4, $4, -1

Notes:
This particular variation of code generation is new and I've seen other contortions, so hopefully we can find a fix that's deeper than just spotting the pattern and replacing. It seems like, on some early level, branch delay semantics are not known and the instructions are contorted to put everything before the branch, eliminating the opportunity in the later levels to see the obvious use of the branch delay slot.

davidchisnall · 2017-08-31T10:04:18Z

At O2, I see this output:

	clb	$1, $4, 0($c3)
	csb	$1, $4, 0($c4)
	addiu	$2, $2, 1
	bnez	$2, .LBB0_1
	daddiu	$4, $4, -1

This is very close to correct. It seems that the loop is being turned into canonical form, where the loop induction variable counts up from zero.

The IR for the loop is:

do.body:                                          ; preds = %do.body, %entry
  %indvars.iv = phi i64 [ %indvars.iv.next, %do.body ], [ %0, %entry ]
  %arrayidx = getelementptr inbounds i8, i8 addrspace(200)* %in, i64 %indvars.iv
  %1 = load i8, i8 addrspace(200)* %arrayidx, align 1, !tbaa !3
  %arrayidx2 = getelementptr inbounds i8, i8 addrspace(200)* %out, i64 %indvars.iv
  store i8 %1, i8 addrspace(200)* %arrayidx2, align 1, !tbaa !3
  %indvars.iv.next = add nsw i64 %indvars.iv, -1
  %2 = trunc i64 %indvars.iv to i32
  %tobool = icmp eq i32 %2, 0
  br i1 %tobool, label %do.end, label %do.body

It looks as if loop-strength reduction is doing this:

do.body:                                          ; preds = %do.body, %entry
  %lsr.iv = phi i32 [ %lsr.iv.next, %do.body ], [ %1, %entry ]
  %indvars.iv = phi i64 [ %indvars.iv.next, %do.body ], [ %0, %entry ]
  %arrayidx = getelementptr inbounds i8, i8 addrspace(200)* %in, i64 %indvars.iv
  %2 = load i8, i8 addrspace(200)* %arrayidx, align 1, !tbaa !3
  %arrayidx2 = getelementptr inbounds i8, i8 addrspace(200)* %out, i64 %indvars.iv
  store i8 %2, i8 addrspace(200)* %arrayidx2, align 1, !tbaa !3
  %indvars.iv.next = add nsw i64 %indvars.iv, -1
  %lsr.iv.next = add i32 %lsr.iv, 1
  %tobool = icmp eq i32 %lsr.iv.next, 0
  br i1 %tobool, label %do.end, label %do.body

This seems to be where the extra induction variable is introduced.

davidchisnall · 2017-08-31T10:08:47Z

This is likely to be difficult to fix, because we typically need LSR to generate good code for MIPS. Without it, we end up with pointers being recomputed from the induction variable, which is a problem for MIPS with the lack of complex addressing modes, because we end up with more complex computations in loops.

davidchisnall · 2017-08-31T10:11:08Z

It appears that LSR has a number of hooks to understand target addressing modes that we're not using...

davidchisnall · 2017-08-31T11:12:31Z

Unfortunately, adding the the hooks doesn't seem to change what LSR does and without LSR we get even worse code:

	clb	$1, $4, 0($c3)
	sll	$2, $4, 0
	daddiu	$3, $4, -1
	csb	$1, $4, 0($c4)
	bnez	$2, .LBB0_1
	move	 $4, $3

This is almost better. The problem appears to be caused by the int. We don't have a 32-bit compare / branch, so we end up doing a truncate and then a 64-bit one. Changing the int to a long in the original example gives us this code:

	daddiu	$2, $zero, -1
.LBB0_1:
	clb	$1, $4, 0($c3)
	csb	$1, $4, 0($c4)
	daddiu	$4, $4, -1
	bne	$4, $2, .LBB0_1
	nop

This is still not an improvement in the instruction count, but it does mean that the target-independent optimisers are doing the right thing. Now the issue is that we can't fill the delay slot.

Currently, the delay-slot filler is fairly simple. To fill this, we'd have to change the branch condition from $2 to $zero. This kind of analysis is pretty hard to do in the machine code (where we know about the delay slot), because we'd need to understand the induction value.

In pure MIPS mode, we end up with this, which is even worse:

	daddiu	$2, $zero, -1
.LBB0_1:                                # %do.body
                                        # =>This Inner Loop Header: Depth=1
	daddu	$1, $4, $6
	lb	$1, 0($1)
	daddu	$3, $5, $6
	daddiu	$6, $6, -1
	bne	$6, $2, .LBB0_1
	sb	$1, 0($3)

This appears to be because the MIPS back end isn't using the register+register addressing mode correctly, so we end up needing two adds, which means that the store can move into the delay slot trivially. If the MIPS back end used the register-register addressing mode better, then this wouldn't be needed.

davidchisnall · 2017-08-31T11:16:47Z

Hmm, apparently I imagined the register-register addressing mode. That explains why this case doesn't matter for MIPS...

jonwoodruff · 2017-08-31T11:23:47Z

FWIW, the loop with the nop in the delay slot is the version I kept getting in memcpy. I must have been using longs (the type of "length" is some sort of parameter) and getting the simple case, but couldn't fill the delay. It would move the extra iteration out of the loop instead of using the delay slot, as you've shown above.

davidchisnall · 2017-08-31T11:31:19Z

I think that the only way to fix this is to teach the delay slot filler about this pattern. It's probably not worth it for memcpy, but if we see this in other benchmarks then it might be.

jonwoodruff · 2017-08-31T11:37:27Z

Can we get an assembly macro for a loop with the iterator in the branch delay slot? Exiting the loop when zero (when equal to another value takes an extra register) and decremented by a parameter?

MIPSLoopToZero(variable to decrement, c statements to do in loop, amount to decrement each time)
e.g.
MIPSLoopToZero(i, out[i]<=in[i]; , 4);

davidchisnall · 2017-08-31T11:45:23Z

This should work:

.macro copyloop load, store, src, dest, inc, induction=$4, tmp=$0
1:
	\load	\tmp, \induction, 0(\src)
	\store	\tmp, \induction, 0(\dest)
	daddiu	\induction, \induction, \inc
	bnez	$induction, 1b
.endm

copyloop clb, csb, $c3, $c4, -1
copyloop clw, csw, $c3, $c4, -4
copyloop cld, csd, $c3, $c4, -8
copyloop clc, csc, $c3, $c4, -16

arichardson · 2019-02-18T17:15:22Z

Seems like we are actually better than MIPS here so not filling the delay slot isn't too bad.

void doLoop_cap(char * __capability in, char * __capability out, int i) {
  do {
    out[i]=in[i];
  } while (i--);
}

void doLoop_ptr(char * in, char * out, int i) {
  do {
    out[i]=in[i];
  } while (i--);
}

	.text
	.abicalls
	.section	.mdebug.abi64,"",@progbits
	.nan	legacy
	.file	"bad-branch-delay-slot-usage-issue-293.c"
	.text
	.globl	doLoop_cap              # -- Begin function doLoop_cap
	.p2align	3
	.type	doLoop_cap,@function
	.set	nomicromips
	.set	nomips16
	.ent	doLoop_cap
doLoop_cap:                             # @doLoop_cap
	.frame	$fp,16,$ra
	.mask 	0x00000000,0
	.fmask	0x00000000,0
	.set	noreorder
	.set	nomacro
	.set	noat
# %bb.0:                                # %entry
	daddiu	$sp, $sp, -16
	sd	$fp, 8($sp)             # 8-byte Folded Spill
	move	$fp, $sp
	daddiu	$2, $zero, -1
.LBB0_1:                                # %do.body
                                        # =>This Inner Loop Header: Depth=1
	clb	$1, $4, 0($c3)
	csb	$1, $4, 0($c4)
	daddiu	$4, $4, -1
	bne	$4, $2, .LBB0_1
	nop
# %bb.2:                                # %do.end
	move	$sp, $fp
	ld	$fp, 8($sp)             # 8-byte Folded Reload
	jr	$ra
	daddiu	$sp, $sp, 16
	.set	at
	.set	macro
	.set	reorder
	.end	doLoop_cap
.Lfunc_end0:
	.size	doLoop_cap, .Lfunc_end0-doLoop_cap
                                        # -- End function
	.globl	doLoop_ptr              # -- Begin function doLoop_ptr
	.p2align	3
	.type	doLoop_ptr,@function
	.set	nomicromips
	.set	nomips16
	.ent	doLoop_ptr
doLoop_ptr:                             # @doLoop_ptr
	.frame	$fp,16,$ra
	.mask 	0x00000000,0
	.fmask	0x00000000,0
	.set	noreorder
	.set	nomacro
	.set	noat
# %bb.0:                                # %entry
	daddiu	$sp, $sp, -16
	sd	$fp, 8($sp)             # 8-byte Folded Spill
	move	$fp, $sp
	daddiu	$2, $zero, -1
.LBB1_1:                                # %do.body
                                        # =>This Inner Loop Header: Depth=1
	daddu	$1, $4, $6
	lb	$1, 0($1)
	daddu	$3, $5, $6
	daddiu	$6, $6, -1
	bne	$6, $2, .LBB1_1
	sb	$1, 0($3)
# %bb.2:                                # %do.end
	move	$sp, $fp
	ld	$fp, 8($sp)             # 8-byte Folded Reload
	jr	$ra
	daddiu	$sp, $sp, 16
	.set	at
	.set	macro
	.set	reorder
	.end	doLoop_ptr
.Lfunc_end1:
	.size	doLoop_ptr, .Lfunc_end1-doLoop_ptr
                                        # -- End function

	.ident	"clang version 8.0.0 "
	.section	".note.GNU-stack","",@progbits
	.addrsig
	.text

arichardson · 2024-02-20T22:30:53Z

Closing this since we don't care about improving MIPS codegen anymore.

arichardson transferred this issue from CTSRD-CHERI/llvm Feb 18, 2019

arichardson added the poor code generation label Feb 18, 2019

arichardson added a commit that referenced this issue Feb 18, 2019

Add test case for #293

3e78d7e

arichardson closed this as completed Feb 20, 2024

arichardson closed this as not planned Won't fix, can't repro, duplicate, stale Feb 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clang/LLVM can't generate optimal loop in MIPS. #293

Clang/LLVM can't generate optimal loop in MIPS. #293

jonwoodruff commented Aug 31, 2017

davidchisnall commented Aug 31, 2017

davidchisnall commented Aug 31, 2017

davidchisnall commented Aug 31, 2017

davidchisnall commented Aug 31, 2017

davidchisnall commented Aug 31, 2017

jonwoodruff commented Aug 31, 2017

davidchisnall commented Aug 31, 2017

jonwoodruff commented Aug 31, 2017

davidchisnall commented Aug 31, 2017

arichardson commented Feb 18, 2019

arichardson commented Feb 20, 2024

Clang/LLVM can't generate optimal loop in MIPS. #293

Clang/LLVM can't generate optimal loop in MIPS. #293

Comments

jonwoodruff commented Aug 31, 2017

davidchisnall commented Aug 31, 2017

davidchisnall commented Aug 31, 2017

davidchisnall commented Aug 31, 2017

davidchisnall commented Aug 31, 2017

davidchisnall commented Aug 31, 2017

jonwoodruff commented Aug 31, 2017

davidchisnall commented Aug 31, 2017

jonwoodruff commented Aug 31, 2017

davidchisnall commented Aug 31, 2017

arichardson commented Feb 18, 2019

arichardson commented Feb 20, 2024