Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reorder instructions to delay read after load #118

Open
xrchz opened this issue Jul 11, 2016 · 8 comments
Open

Reorder instructions to delay read after load #118

xrchz opened this issue Jul 11, 2016 · 8 comments

Comments

@xrchz
Copy link
Member

xrchz commented Jul 11, 2016

Summary: in wordLang, moving one or two instructions to be after memory loads before they are first read, if possible, can improve performance on pipelined architectures.

More details: https://lists.cakeml.org/pipermail/users/2016-April/000088.html

I believe @cmr may be looking into this.

myreen added a commit that referenced this issue Jul 13, 2016
The composition of bvp-to-word is complicated and difficult
to navigate, but we want to add a few simple optimisation
passes that run right after translation into wordLang and
before register allocation.

The purpose of word_simp is to be an understandable entry
point into this part of the compiler. The correctness
requirements for word_simp are strict and simple:

    evaluate (prog,s) = (res,s2) /\ res <> SOME Error ==>
    evaluate (word_simp$compile_exp prog,s) = (res,s2)

    extract_labels (word_simp$compile_exp p) = extract_labels p

The current implementation of word_simp is the identity
function, but will be expanded to include at least:

 - clean up of wordLang programs: delete Skips and make Seq
   associate to left so that pattern matching is easier in
   other phases. I'm working on this.

 - combine certain consequtive If statements (issue #103)
   I'm working on this.

 - move load instructions upwards (issue #118 ?) Someone else
   might be working on this.
@myreen
Copy link
Contributor

myreen commented Jul 13, 2016

I agree that one should try to insert instructions between memory loads and the first use of the loaded value, i.e.:

x := mem[addr];
<-- insert a few unrelated instructions here if possible -->
instructions that read x

However, I don't understand the part about "before conditionals". Looking at the link, I suspect the "before conditionals"-part might be a too literal reading of my example in the email. It's the first use of the loaded value that is important, not that it was part of a conditional expression.

@xrchz xrchz changed the title Reorder instructions before branches Reorder instructions to delay read after load Jul 13, 2016
@xrchz
Copy link
Member Author

xrchz commented Jul 13, 2016

I updated the issue summary and name accordingly.

@emberian
Copy link

@myreen Ramana and I took a look at this and I think this particular optimization is better done at the asm level. Anything we're missing?

@myreen
Copy link
Contributor

myreen commented Jul 27, 2016

I think the asm level (I suspect you mean LabLang) is way too late for this optimisation. At LabLang you cannot see the difference between stack accesses and general memory accesses. Furthermore, the register allocator might have unknowingly picked register names that prevent the optimisation.

The optimisation should be done before register allocation so that the allocator hasn't produced constraints that prevent reordering.

A similar optimisation could be done in StackLang for stack loads. They should be pushed upwards. Note that this also needs to be done at StackLang before stack accesses look like general memory accesses.

@SOwens
Copy link
Member

SOwens commented Jul 27, 2016

Three comments:

This looks like a simple instruction scheduling optimisation to me. The typical idea is that if you do it before register allocation, it might increase live ranges and result in a worse allocation. If you do it after, there are false dependencies that impede the scheduling.

I would guess on a modern out-of-order architecture that the hardware's issue buffer will easily do this optimisation for you, especially if you aren't moving the load very far. I wouldn't expect to see any performance gains on an x86 desktop/laptop. Is there a class of ARMs that are pipelined, but not out-of-order? I'd doubt it: on all of their multi-cores, speculative execution is observable. Would an embedded device be pipelined?

Why does it matter if you can tell the difference between stack accesses and other accesses to memory? I guess stack accesses are likely to hit in L1 cache, but otherwise they are just normal loads, right?

@myreen
Copy link
Contributor

myreen commented Jul 27, 2016

Even the simplest ARM processors (M0) have pipelines. I doubt there are processors without pipelines nowadays. I agree that on x86 this optimisation will in all likelihood be done on the fly, but small ARMs are not terrible smart with instruction reordering, as far as I know.

The difference between stack and heap accesses: In the semantics of StackLang, it is obvious that you can always move a stack access independently of heap accesses. Once you are in LabLang, all of them become opaque memory accesses and thus reordering might be impossible. That's why it makes sense to do the reordering while they are still proper stack accesses in the semantics.

In case @cmr wants to look at something else then here are some suggestions:

  • constant propagation in WordLang: currently data-to-word never produces things like compare-with-immediate, a simple pass that copies in constant values as immediate values is a good optimisation to have. WordLang's new deadcode elimination phase can then delete the original constant assignments if they become dead.
  • lumping together of single Moves into parallel Moves: again data-to-word never produces parallel moves, which WordLang supports and word-to-stack compiles well. An inspection of some code Yong Kiam produced with the compiler shows that there are a lot of single Moves that ought to be lumped together.
  • simple optimisations in LabLang: LabLang contains places where an unconditional jump jumps to the next instruction. Such jumps should be rewritten to Skip (which gets deleted by lab_filter)

@xrchz
Copy link
Member Author

xrchz commented Jul 27, 2016

I have created new issues for the above suggestions, and I encourage you to create more if you have more. It's good to have simple tasks like this for newcomers to get an introduction to CakeML proofs. I will update the website too, to point to them (and we might consider an issue label to mark such issues, both so newbies can find them and so experienced hackers don't mop them up too quickly). The issues are #129, #130, and #131.

@tanyongkiam
Copy link
Contributor

This issue would benefit from some empirical experimentation first, e.g., looking at explorer output.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants