Reorder instructions to delay read after load #118

xrchz · 2016-07-11T05:36:44Z

Summary: in wordLang, moving one or two instructions to be after memory loads before they are first read, if possible, can improve performance on pipelined architectures.

More details: https://lists.cakeml.org/pipermail/users/2016-April/000088.html

I believe @cmr may be looking into this.

The text was updated successfully, but these errors were encountered:

The composition of bvp-to-word is complicated and difficult to navigate, but we want to add a few simple optimisation passes that run right after translation into wordLang and before register allocation. The purpose of word_simp is to be an understandable entry point into this part of the compiler. The correctness requirements for word_simp are strict and simple: evaluate (prog,s) = (res,s2) /\ res <> SOME Error ==> evaluate (word_simp$compile_exp prog,s) = (res,s2) extract_labels (word_simp$compile_exp p) = extract_labels p The current implementation of word_simp is the identity function, but will be expanded to include at least: - clean up of wordLang programs: delete Skips and make Seq associate to left so that pattern matching is easier in other phases. I'm working on this. - combine certain consequtive If statements (issue #103) I'm working on this. - move load instructions upwards (issue #118 ?) Someone else might be working on this.

myreen · 2016-07-13T20:26:35Z

I agree that one should try to insert instructions between memory loads and the first use of the loaded value, i.e.:

x := mem[addr];
<-- insert a few unrelated instructions here if possible -->
instructions that read x

However, I don't understand the part about "before conditionals". Looking at the link, I suspect the "before conditionals"-part might be a too literal reading of my example in the email. It's the first use of the loaded value that is important, not that it was part of a conditional expression.

xrchz · 2016-07-13T23:01:53Z

I updated the issue summary and name accordingly.

emberian · 2016-07-27T05:27:15Z

@myreen Ramana and I took a look at this and I think this particular optimization is better done at the asm level. Anything we're missing?

myreen · 2016-07-27T05:44:52Z

I think the asm level (I suspect you mean LabLang) is way too late for this optimisation. At LabLang you cannot see the difference between stack accesses and general memory accesses. Furthermore, the register allocator might have unknowingly picked register names that prevent the optimisation.

The optimisation should be done before register allocation so that the allocator hasn't produced constraints that prevent reordering.

A similar optimisation could be done in StackLang for stack loads. They should be pushed upwards. Note that this also needs to be done at StackLang before stack accesses look like general memory accesses.

SOwens · 2016-07-27T09:54:19Z

Three comments:

This looks like a simple instruction scheduling optimisation to me. The typical idea is that if you do it before register allocation, it might increase live ranges and result in a worse allocation. If you do it after, there are false dependencies that impede the scheduling.

I would guess on a modern out-of-order architecture that the hardware's issue buffer will easily do this optimisation for you, especially if you aren't moving the load very far. I wouldn't expect to see any performance gains on an x86 desktop/laptop. Is there a class of ARMs that are pipelined, but not out-of-order? I'd doubt it: on all of their multi-cores, speculative execution is observable. Would an embedded device be pipelined?

Why does it matter if you can tell the difference between stack accesses and other accesses to memory? I guess stack accesses are likely to hit in L1 cache, but otherwise they are just normal loads, right?

myreen · 2016-07-27T10:09:43Z

Even the simplest ARM processors (M0) have pipelines. I doubt there are processors without pipelines nowadays. I agree that on x86 this optimisation will in all likelihood be done on the fly, but small ARMs are not terrible smart with instruction reordering, as far as I know.

The difference between stack and heap accesses: In the semantics of StackLang, it is obvious that you can always move a stack access independently of heap accesses. Once you are in LabLang, all of them become opaque memory accesses and thus reordering might be impossible. That's why it makes sense to do the reordering while they are still proper stack accesses in the semantics.

In case @cmr wants to look at something else then here are some suggestions:

constant propagation in WordLang: currently data-to-word never produces things like compare-with-immediate, a simple pass that copies in constant values as immediate values is a good optimisation to have. WordLang's new deadcode elimination phase can then delete the original constant assignments if they become dead.
lumping together of single Moves into parallel Moves: again data-to-word never produces parallel moves, which WordLang supports and word-to-stack compiles well. An inspection of some code Yong Kiam produced with the compiler shows that there are a lot of single Moves that ought to be lumped together.
simple optimisations in LabLang: LabLang contains places where an unconditional jump jumps to the next instruction. Such jumps should be rewritten to Skip (which gets deleted by lab_filter)

xrchz · 2016-07-27T22:21:40Z

I have created new issues for the above suggestions, and I encourage you to create more if you have more. It's good to have simple tasks like this for newcomers to get an introduction to CakeML proofs. I will update the website too, to point to them (and we might consider an issue label to mark such issues, both so newbies can find them and so experienced hackers don't mop them up too quickly). The issues are #129, #130, and #131.

tanyongkiam · 2024-11-12T12:41:00Z

This issue would benefit from some empirical experimentation first, e.g., looking at explorer output.

xrchz changed the title ~~Reorder instructions before branches~~ Reorder instructions to delay read after load Jul 13, 2016

sorear mentioned this issue Sep 17, 2020

Functional SSA #764

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reorder instructions to delay read after load #118

Reorder instructions to delay read after load #118

xrchz commented Jul 11, 2016 •

edited

Loading

myreen commented Jul 13, 2016

xrchz commented Jul 13, 2016

emberian commented Jul 27, 2016

myreen commented Jul 27, 2016

SOwens commented Jul 27, 2016

myreen commented Jul 27, 2016

xrchz commented Jul 27, 2016 •

edited

Loading

tanyongkiam commented Nov 12, 2024

Reorder instructions to delay read after load #118

Reorder instructions to delay read after load #118

Comments

xrchz commented Jul 11, 2016 • edited Loading

myreen commented Jul 13, 2016

xrchz commented Jul 13, 2016

emberian commented Jul 27, 2016

myreen commented Jul 27, 2016

SOwens commented Jul 27, 2016

myreen commented Jul 27, 2016

xrchz commented Jul 27, 2016 • edited Loading

tanyongkiam commented Nov 12, 2024

xrchz commented Jul 11, 2016 •

edited

Loading

xrchz commented Jul 27, 2016 •

edited

Loading