Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clang/LLVM can't generate optimal loop in MIPS. #293

Closed
jonwoodruff opened this issue Aug 31, 2017 · 11 comments
Closed

Clang/LLVM can't generate optimal loop in MIPS. #293

jonwoodruff opened this issue Aug 31, 2017 · 11 comments

Comments

@jonwoodruff
Copy link

MemCopyLoop.zip

C function:
void doLoop(char * __capability in, char * __capability out, int i)
{
do {
out[i]=in[i];
} while (i--);
}

Generated assembly for loop:
not $2, $4
.LBB0_1:
clb $1, $4, 0($c3)
addiu $2, $2, 1
csb $1, $4, 0($c4)
bnez $2, .LBB0_1
daddiu $4, $4, -1

Desired assembly for loop:
.LBB0_1:
clb $1, $4, 0($c3)
csb $1, $4, 0($c4)
bnez $4, .LBB0_1
daddiu $4, $4, -1

Notes:
This particular variation of code generation is new and I've seen other contortions, so hopefully we can find a fix that's deeper than just spotting the pattern and replacing. It seems like, on some early level, branch delay semantics are not known and the instructions are contorted to put everything before the branch, eliminating the opportunity in the later levels to see the obvious use of the branch delay slot.

@davidchisnall
Copy link
Member

At O2, I see this output:

	clb	$1, $4, 0($c3)
	csb	$1, $4, 0($c4)
	addiu	$2, $2, 1
	bnez	$2, .LBB0_1
	daddiu	$4, $4, -1

This is very close to correct. It seems that the loop is being turned into canonical form, where the loop induction variable counts up from zero.

The IR for the loop is:

do.body:                                          ; preds = %do.body, %entry
  %indvars.iv = phi i64 [ %indvars.iv.next, %do.body ], [ %0, %entry ]
  %arrayidx = getelementptr inbounds i8, i8 addrspace(200)* %in, i64 %indvars.iv
  %1 = load i8, i8 addrspace(200)* %arrayidx, align 1, !tbaa !3
  %arrayidx2 = getelementptr inbounds i8, i8 addrspace(200)* %out, i64 %indvars.iv
  store i8 %1, i8 addrspace(200)* %arrayidx2, align 1, !tbaa !3
  %indvars.iv.next = add nsw i64 %indvars.iv, -1
  %2 = trunc i64 %indvars.iv to i32
  %tobool = icmp eq i32 %2, 0
  br i1 %tobool, label %do.end, label %do.body

It looks as if loop-strength reduction is doing this:

do.body:                                          ; preds = %do.body, %entry
  %lsr.iv = phi i32 [ %lsr.iv.next, %do.body ], [ %1, %entry ]
  %indvars.iv = phi i64 [ %indvars.iv.next, %do.body ], [ %0, %entry ]
  %arrayidx = getelementptr inbounds i8, i8 addrspace(200)* %in, i64 %indvars.iv
  %2 = load i8, i8 addrspace(200)* %arrayidx, align 1, !tbaa !3
  %arrayidx2 = getelementptr inbounds i8, i8 addrspace(200)* %out, i64 %indvars.iv
  store i8 %2, i8 addrspace(200)* %arrayidx2, align 1, !tbaa !3
  %indvars.iv.next = add nsw i64 %indvars.iv, -1
  %lsr.iv.next = add i32 %lsr.iv, 1
  %tobool = icmp eq i32 %lsr.iv.next, 0
  br i1 %tobool, label %do.end, label %do.body

This seems to be where the extra induction variable is introduced.

@davidchisnall
Copy link
Member

This is likely to be difficult to fix, because we typically need LSR to generate good code for MIPS. Without it, we end up with pointers being recomputed from the induction variable, which is a problem for MIPS with the lack of complex addressing modes, because we end up with more complex computations in loops.

@davidchisnall
Copy link
Member

It appears that LSR has a number of hooks to understand target addressing modes that we're not using...

@davidchisnall
Copy link
Member

Unfortunately, adding the the hooks doesn't seem to change what LSR does and without LSR we get even worse code:

	clb	$1, $4, 0($c3)
	sll	$2, $4, 0
	daddiu	$3, $4, -1
	csb	$1, $4, 0($c4)
	bnez	$2, .LBB0_1
	move	 $4, $3

This is almost better. The problem appears to be caused by the int. We don't have a 32-bit compare / branch, so we end up doing a truncate and then a 64-bit one. Changing the int to a long in the original example gives us this code:

	daddiu	$2, $zero, -1
.LBB0_1:
	clb	$1, $4, 0($c3)
	csb	$1, $4, 0($c4)
	daddiu	$4, $4, -1
	bne	$4, $2, .LBB0_1
	nop

This is still not an improvement in the instruction count, but it does mean that the target-independent optimisers are doing the right thing. Now the issue is that we can't fill the delay slot.

Currently, the delay-slot filler is fairly simple. To fill this, we'd have to change the branch condition from $2 to $zero. This kind of analysis is pretty hard to do in the machine code (where we know about the delay slot), because we'd need to understand the induction value.

In pure MIPS mode, we end up with this, which is even worse:

	daddiu	$2, $zero, -1
.LBB0_1:                                # %do.body
                                        # =>This Inner Loop Header: Depth=1
	daddu	$1, $4, $6
	lb	$1, 0($1)
	daddu	$3, $5, $6
	daddiu	$6, $6, -1
	bne	$6, $2, .LBB0_1
	sb	$1, 0($3)

This appears to be because the MIPS back end isn't using the register+register addressing mode correctly, so we end up needing two adds, which means that the store can move into the delay slot trivially. If the MIPS back end used the register-register addressing mode better, then this wouldn't be needed.

@davidchisnall
Copy link
Member

Hmm, apparently I imagined the register-register addressing mode. That explains why this case doesn't matter for MIPS...

@jonwoodruff
Copy link
Author

FWIW, the loop with the nop in the delay slot is the version I kept getting in memcpy. I must have been using longs (the type of "length" is some sort of parameter) and getting the simple case, but couldn't fill the delay. It would move the extra iteration out of the loop instead of using the delay slot, as you've shown above.

@davidchisnall
Copy link
Member

I think that the only way to fix this is to teach the delay slot filler about this pattern. It's probably not worth it for memcpy, but if we see this in other benchmarks then it might be.

@jonwoodruff
Copy link
Author

Can we get an assembly macro for a loop with the iterator in the branch delay slot? Exiting the loop when zero (when equal to another value takes an extra register) and decremented by a parameter?

MIPSLoopToZero(variable to decrement, c statements to do in loop, amount to decrement each time)
e.g.
MIPSLoopToZero(i, out[i]<=in[i]; , 4);

@davidchisnall
Copy link
Member

This should work:

.macro copyloop load, store, src, dest, inc, induction=$4, tmp=$0
1:
	\load	\tmp, \induction, 0(\src)
	\store	\tmp, \induction, 0(\dest)
	daddiu	\induction, \induction, \inc
	bnez	$induction, 1b
.endm

copyloop clb, csb, $c3, $c4, -1
copyloop clw, csw, $c3, $c4, -4
copyloop cld, csd, $c3, $c4, -8
copyloop clc, csc, $c3, $c4, -16

@arichardson arichardson transferred this issue from CTSRD-CHERI/llvm Feb 18, 2019
@arichardson
Copy link
Member

Seems like we are actually better than MIPS here so not filling the delay slot isn't too bad.

void doLoop_cap(char * __capability in, char * __capability out, int i) {
  do {
    out[i]=in[i];
  } while (i--);
}

void doLoop_ptr(char * in, char * out, int i) {
  do {
    out[i]=in[i];
  } while (i--);
}
	.text
	.abicalls
	.section	.mdebug.abi64,"",@progbits
	.nan	legacy
	.file	"bad-branch-delay-slot-usage-issue-293.c"
	.text
	.globl	doLoop_cap              # -- Begin function doLoop_cap
	.p2align	3
	.type	doLoop_cap,@function
	.set	nomicromips
	.set	nomips16
	.ent	doLoop_cap
doLoop_cap:                             # @doLoop_cap
	.frame	$fp,16,$ra
	.mask 	0x00000000,0
	.fmask	0x00000000,0
	.set	noreorder
	.set	nomacro
	.set	noat
# %bb.0:                                # %entry
	daddiu	$sp, $sp, -16
	sd	$fp, 8($sp)             # 8-byte Folded Spill
	move	$fp, $sp
	daddiu	$2, $zero, -1
.LBB0_1:                                # %do.body
                                        # =>This Inner Loop Header: Depth=1
	clb	$1, $4, 0($c3)
	csb	$1, $4, 0($c4)
	daddiu	$4, $4, -1
	bne	$4, $2, .LBB0_1
	nop
# %bb.2:                                # %do.end
	move	$sp, $fp
	ld	$fp, 8($sp)             # 8-byte Folded Reload
	jr	$ra
	daddiu	$sp, $sp, 16
	.set	at
	.set	macro
	.set	reorder
	.end	doLoop_cap
.Lfunc_end0:
	.size	doLoop_cap, .Lfunc_end0-doLoop_cap
                                        # -- End function
	.globl	doLoop_ptr              # -- Begin function doLoop_ptr
	.p2align	3
	.type	doLoop_ptr,@function
	.set	nomicromips
	.set	nomips16
	.ent	doLoop_ptr
doLoop_ptr:                             # @doLoop_ptr
	.frame	$fp,16,$ra
	.mask 	0x00000000,0
	.fmask	0x00000000,0
	.set	noreorder
	.set	nomacro
	.set	noat
# %bb.0:                                # %entry
	daddiu	$sp, $sp, -16
	sd	$fp, 8($sp)             # 8-byte Folded Spill
	move	$fp, $sp
	daddiu	$2, $zero, -1
.LBB1_1:                                # %do.body
                                        # =>This Inner Loop Header: Depth=1
	daddu	$1, $4, $6
	lb	$1, 0($1)
	daddu	$3, $5, $6
	daddiu	$6, $6, -1
	bne	$6, $2, .LBB1_1
	sb	$1, 0($3)
# %bb.2:                                # %do.end
	move	$sp, $fp
	ld	$fp, 8($sp)             # 8-byte Folded Reload
	jr	$ra
	daddiu	$sp, $sp, 16
	.set	at
	.set	macro
	.set	reorder
	.end	doLoop_ptr
.Lfunc_end1:
	.size	doLoop_ptr, .Lfunc_end1-doLoop_ptr
                                        # -- End function

	.ident	"clang version 8.0.0 "
	.section	".note.GNU-stack","",@progbits
	.addrsig
	.text

arichardson added a commit that referenced this issue Feb 18, 2019
@arichardson
Copy link
Member

Closing this since we don't care about improving MIPS codegen anymore.

@arichardson arichardson closed this as not planned Won't fix, can't repro, duplicate, stale Feb 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants