forked from scala-lms/tutorials
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path03_compiler.scala
1483 lines (1136 loc) · 64 KB
/
03_compiler.scala
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
/**
# (Chapter 0) Intro: Not your Grandfather's Compiler
<a name="chap:300"></a>
This part discusses compiler internals. How do embedded compilers compile
their programs?
Outline:
<div id="tableofcontents"></div>
The purely string based representation of staged programs from [Part
2](02_basics.html) does not allow analysis or transformation of embedded
programs. Since LMS is not inherently tied to a particular program
representation it is very easy to pick one that is better suited for
optimization. As a first cut, we switch to an intermediate representation (IR)
based on expression trees, adding a level of indirection between construction
of object programs and code generation (see [below](#chap:310trees)). On this
tree IR we can define traversal and transformation passes and build a
straightforward embedded compiler. We can add new IR node types and new
transformation passes that implement domain specific optimizations. In
particular we can use multiple passes of staging: While traversing
(effectively, interpreting) one IR we can execute staging commands to build
another staged program, possibly in a different, lower-level object language.
However the extremely high degree of extensibility poses serious challenges.
In particular, the interplay of optimizations implemented as many separate
phases does not yield good results due to the phase ordering problem: It is
unclear in which order and how often to execute these phases, and since each
optimization pass has to make pessimistic assumptions about the outcome of
all other passes the global result is often suboptimal compared to a
dedicated, combined optimization phase
[(*)](veldhuizen:combiningoptimizations,click95combineanalysis). There are
also implementation challenges as each optimization needs to be designed to
treat unknown IR nodes in a sensible way.
Other challenges are due to the fact that embedded compilers are supposed to
be used like libraries. Extending an embedded compiler should be easy, and as
much of the work as possible should be delegated to a library of compiler
components. Newly defined high-level IR nodes should profit from generic
optimizations automatically.
To remedy this situation, we switch to a graph-based ''sea of nodes''
representation (see [below](#chap:320graphs)). This representation links
definitions and uses, and it also reflects the program block structure via
nesting edges. We consider purely functional programs first. A number of
nontrivial optimizations become considerably simpler. Common subexpression
elimination (CSE) and dead code elimination (DCE) are particularly easy. Both
are completely generic and support an open domain of IR node types.
Optimizations that can be expressed as context-free rewrites are also easy to
add in a modular fashion. A scheduling and code motion algorithm transforms
graphs back into trees, moving computations to places where they are less
often executed, e.g. out of loops or functions. Both graph-based and tree-
based transformations are useful: graph-based transformations are usually
simpler and more efficient whereas tree-based transformations, implemented as
multiple staging passes, can be more powerful and employ arbitrary context-
sensitive information.
To support effectful programs, we make effects explicit in the dependency
graph (similar to SSA form). We can support simple effect domains (pure vs
effectful) and more fine grained ones, such as tracking modifications per
allocation site. The latter one relies on alias and points-to analysis.
We then turn to advanced optimizations (see [below](#chap:330opt)). For
combining analyses and optimizations, it is crucial to maintain optimistic
assumptions for all analyses. The key challenge is that one analysis has to
anticipate the effects of the other transformations. The solution is
speculative rewriting [(*)](lerner02composingdataflow): transform a program
fragment in the presence of partial and possibly unsound analysis results and
re-run the analyses on the transformed code until a fixpoint is reached. This
way, different analyses can communicate through the transformed code and need
not anticipate their results in other ways. Using speculative rewriting, we
compose many optimizations into more powerful combined passes. Often, a
single forward simplification pass that can be used to clean up after non-
optimizing transformations is sufficient.
However not all rewrites can fruitfully be combined into a single phase. For
example, high-level representations of linear algebra operations may give rise
to rewrite rules like $I M \rightarrow M$ where $I$ is the identity matrix. At
the same time, there may be rules that define how a matrix multiplication can
be implemented in terms of arrays and while loops, or a call to an external
library (BLAS). To be effective, all the high-level simplifications need to
be applied exhaustively before any of the lowering transformations are
applied. But lowering transformations may create new opportunities for high-
level rules, too. Our solution here is delayed rewriting: programmers can
specify that a certain rewrite should not be applied right now, but have it
registered to be executed at the next iteration of a particular phase. Delayed
rewriting thus provides a way of grouping and prioritizing modularly defined
transformations.
On top of this infrastructure, we build a number of advanced optimizations. A
general pattern is split and merge: We split operations and data structures in
order to expose their components to rewrites and dead-code elimination and
then merge the remaining parts back together. This struct transformation also
allows for more general data structure conversions, including array-of-struct
to struct-of-array representation conversion. Furthermore we present a novel
loop fusion algorithm, a powerful transformation that removes intermediate
data structures.
# (Chapter 1) Intermediate Representation: Trees
\label{chap:310trees}
With the aim of generating code, we could represent staged expressions
directly as strings, as done in [Part 2](02_basics.html). But for optimization
purposes we would rather have a structured intermediate representation that we
can analyze in various ways. Fortunately, LMS makes it very easy to use a
different internal program representation.
# Trees Instead of Strings
\label{sec:301}
Our starting point is an object language _interface_ derived from
[Part 2](02_basics.html):
trait Base {
type Rep[T]
}
trait Arith extends Base {
def infix_+(x: Rep[Double], y: Rep[Double]): Rep[Double]
def infix_*(x: Rep[Double], y: Rep[Double]): Rep[Double]
...
}
trait IfThenElse extends Base {
def __ifThenElse[T](c: Rep[Boolean], a: =>Rep[T], b: =>Rep[T]): Rep[T]
}
The goal will be to build a corresponding _implementation_ hierarchy that
supports optimizing compilation.
Splitting interface and implementation has many advantages, most importantly a
clear separation between the user program world and the compiler
implementation world. For the sake of completeness, let us briefly recast the
string representation in this model:
trait BaseStr extends Base {
type Rep[T] = String
}
trait ArithStr extends BaseStr with Arith {
def infix_+(x: Rep[Double], y: Rep[Double]) = perform(x + " + " + y)
def infix_*(x: Rep[Double], y: Rep[Double]) = perform(x + " * " + y)
...
}
trait IfThenElseStr extends BaseStr with IfThenElse {
def __ifThenElse[T](c: Rep[Boolean], a: =>Rep[T], b: =>Rep[T]) =
perform("if (" + c + ") " + accumulate(a) + " else " + accumulate(b))
}
In this chapter, we will use an IR that is based on expression trees, closely
resembling the abstract syntax tree (AST) of a staged program. This
representation enables separate analysis, optimization and code generation
passes. We will use the following types:
type Exp[T] // atomic: Sym, Const
type Def[T] // composite: Exp + Exp, Exp * Exp, ...
type Stm[T] // statement: val x = Def
type Block[T] // blocks: { Stm; ...; Stm; Exp }
They are defined as follows in a separate trait:
trait Expressions {
// expressions (atomic)
abstract class Exp[T]
case class Const[T](x: T) extends Exp[T]
case class Sym[T](n: Int) extends Exp[T]
def fresh[T]: Sym[T]
// definitions (composite, subclasses provided by other traits)
abstract class Def[T]
// statements
case class Stm[T](sym: Sym[T], rhs: Def[T])
// blocks
case class Block[T](stms: Stm[_], res: Exp[T])
// perform and accumulate
def reflectStm[T](d: Stm[T]): Exp[T]
def reifyBlock[T](b: =>Exp[T]): Block[T]
// bind definitions to symbols automatically
// by creating a statement
implicit def toAtom[T](d: Def[T]): Exp[T] =
reflectStm(Stm(fresh[T], d))
}
This trait `Expressions` will be mixed in at the root of the object language
implementation hierarchy. The guiding principle is that each definition has
an associated symbol and refers to other definitions only via their symbols.
This means that every composite value will be named, similar to
administrative normal form (ANF). Methods `reflectStm` and `reifyBlock` take
over the responsibility of `perform` and `accumulate`.
## Modularity: Adding IR Node Types
We observe that there are no concrete definition classes provided by trait
`Expressions`. Providing meaningful data types is the responsibility of other
traits that implement the interfaces defined previously (`Base` and its
descendents).
Trait `BaseExp` forms the root of the implementation hierarchy and installs
atomic expressions as the representation of staged values by defining
`Rep[T] = Exp[T]`:
trait BaseExp extends Base with Expressions {
type Rep[T] = Exp[T]
}
For each interface trait, there is one corresponding core implementation
trait. Shown below, we have traits `ArithExp` and `IfThenElseExp` as the
running example. Both traits define one definition class for each operation
defined by `Arith` and `IfThenElse`, respectively, and implement the
corresponding interface methods to create instances of those classes.
trait ArithExp extends BaseExp with Arith {
case class Plus(x: Exp[Double], y: Exp[Double]) extends Def[Double]
case class Times(x: Exp[Double], y: Exp[Double]) extends Def[Double]
def infix_+(x: Rep[Double], y: Rep[Double]) = Plus(x, y)
def infix_*(x: Rep[Double], y: Rep[Double]) = Times(x, y)
...
}
trait IfThenElseExp extends BaseExp with IfThenElse {
case class IfThenElse(c: Exp[Boolean], a: Block[T], b: Block[T]) extends Def[T]
def __ifThenElse[T](c: Rep[Boolean], a: =>Rep[T], b: =>Rep[T]): Rep[T] =
IfThenElse(c, reifyBlock(a), reifyBlock(b))
}
The framework ensures that code that contains staging operations will always
be executed within the dynamic scope of at least one invocation of
`reifyBlock`, which returns a block object and takes as call-by-name argument
the present-stage expression that will compute the staged block result. Block
objects can be part of definitions, e.g. for loops or conditionals.
Since all operations in interface traits such as `Arith` return `Rep` types,
equating `Rep[T]` and `Exp[T]` in trait `BaseExp` means that conversion to
symbols will take place already within those methods. This fact is important
because it establishes our correspondence between the evaluation order of the
program generator and the evaluation order of the generated program: at the
point where the generator calls `toAtom`, the composite definition is turned
into an atomic value via `reflectStm`, i.e. its evaluation will be recorded
now and played back later in the same relative order with respect to others
within the closest `reifyBlock` invocation.
# Enabling Analysis and Transformation
Given our IR representation it is easy to add traversals and transformations.
## Modularity: Adding Traversal Passes
All that is needed to define a generic in-order traversal is a way to access
all blocks immediately contained in a definition:
def blocks(x: Any): List[Block[Any]]
For example, applying `blocks` to an `IfThenElse` node will return the then
and else blocks. Since definitions are case classes, this method is easy to
implement by using the `Product` interface that all case classes implement.
The basic structural in-order traversal is then defined like this:
trait ForwardTraversal {
val IR: Expressions
import IR._
def traverseBlock[T](b: Block[T]): Unit = b.stms.foreach(traverseStm)
def traverseStm[T](s: Stm[T]): Unit = blocks(s).foreach(traverseBlock)
}
Custom traversals can be implemented in a modular way by extending the
`ForwardTraversal` trait:
trait MyTraversalBase extends ForwardTraversal {
val IR: BaseExp
import IR._
override def traverseStm[T](s: Stm[T]) = s match {
// custom base case or delegate to super
case _ => super.traverseStm(s)
}
}
trait MyTraversalArith extends MyTraversalBase {
val IR: ArithExp
import IR._
override def traverseStm[T](s: Stm[T]) = s match {
case Plus(x,y) => ... // handle specific nodes
case _ => super.traverseStm(s)
}
}
For each unit of functionality such as `Arith` or `IfThenElse` the traversal
actions can be defined separately as `MyTraversalArith` and
`MyTraversalIfThenElse`.
Finally, we can use our traversal as follows:
trait Prog extends Arith {
def main = ... // program code here
}
val impl = new Prog with ArithExp
val res = impl.reifyBlock(impl.main)
val inspect = MyTraversalArith { val IR: impl.type = impl }
inspect.traverseBlock(res)
## Solving the ''Expression Problem''
In essence, traversals confront us with the classic ''expression problem'' of
independently extending a data model with new data variants and new
operations [(*)](wadlerExprProblem). There are many solutions to this problem
but most of them are rather heavyweight. More lightweight implementations are
possible in languages that support multi-methods, i.e. dispatch method calls
dynamically based on the actual types of all the arguments. We can achieve
essentially the same using pattern matching and mixin composition, making use
of the fact that composing traits is subject to linearization
[(*)](DBLP:conf/oopsla/OderskyZ05). We package each set of specific traversal
rules into its own trait, e.g. `MyTraversalArith` that inherits from
`MyTraversalBase` and overrides `traverseStm`. When the arguments do not match
the rewriting pattern, the overridden method will invoke the ''parent''
implementation using `super`. When several such traits are combined, the super
calls will traverse the overridden method implementations according to the
linearization order of their containing traits. The use of pattern matching
and super calls is similar to earlier work on extensible algebraic data types
with defaults [(*)](DBLP:conf/icfp/ZengerO01), which supported linear
extensions but not composition of independent extensions.
Implementing multi-methods in a statically typed setting usually poses three
problems: separate type checking/compilation, ensuring non-ambiguity and
ensuring exhaustiveness. The described encoding supports separate
type-checking and compilation in as far as traits do. Ambiguity is ruled out
by always following the linearization order and the first-match semantics of
pattern matching. Exhaustiveness is ensured at the type level by requiring a
default implementation, although no guarantees can be made that the default
will not choose to throw an exception at runtime. In the particular case of
traversals, the default is always safe and will just continue the structural
traversal.
## Generating Code
Code generation is just a traversal pass that prints code. Compiling and
executing code can use the same mechanism as described
[above](#sec:230codegen).
## Modularity: Adding Transformations
Transformations work very similar to traversals. One option is to traverse and
transform an existing program more or less in place, not actually modifying
data but attaching new Defs to existing Syms:
trait SimpleTransformer {
val IR: Expressions
import IR._
def transformBlock[T](b: Block[T]): Block[T] =
Block(b.stms.flatMap(transformStm), transformExp(b.res))
def transformStm[T](s: Stm[T]): List[Stm] =
List(Stm(s.lhs, transformDef(s.rhs))) // preserve existing symbol s
def transformDef[T](d: Def[T]): Def[T] // default: use reflection
// to map over case classes
}
An implementation is straightforward:
trait MySimpleTransformer extends SimpleTransformer {
val IR: IfThenElseExp
import IR._
// override transformDef for each Def subclass
def transformDef[T](d: Def[T]): Def[T] = d match {
case IfThenElse(c,a,b) =>
IfThenElse(transformExp(c), transformBlock(a), transformBlock(b))
case _ => super.transformDef(d)
}
}
## Transformation by Iterated Staging
\label{sec:310treeTrans}
Another option that is more principled and in line with the idea of making
compiler transforms programmable through the use of staging is to traverse the
old program and create a new program. Effectively we are implementing an IR
interpreter that executes staging commands, which greatly simplifies the
implementation of the transform and removes the need for low-level IR
manipulation.
In the implementation, we will create new symbols instead of reusing existing
ones so we need to maintain a substitution that maps old to new Syms. The core
implementation is given below:
trait ForwardTransformer extends ForwardTraversal {
val IR: Expressions
import IR._
var subst: Map[Exp[_],Exp[_]]
def transformExp[T](s: Exp[T]): Exp[T] = ... // lookup s in subst
def transformDef[T](d: Def[T]): Exp[T] // default
def transformStm[T](s: Stm[T]): Exp[T] = {
val e = transformDef(s.rhs); subst += (s.sym -> e); e
}
override def traverseStm[T](s: Stm[T]): Unit = {
transformStm(s)
}
def reflectBlock[T](b: Block[T]): Exp[T] = withSubstScope {
traverseBlock(b); transformExp(b.res)
}
def transformBlock[T](b: Block[T]): Block[T] = {
reifyBlock(reflectBlock(b))
}
}
Here is a simple identity transformer implementation for conditionals and
array construction:
trait MyTransformer extends ForwardTransformer {
val IR: IfThenElseExp with ArraysExp
import IR._
def transformDef[T](d: Def[T]): Exp[T] = d match {
case IfThenElse(c,a,b) =>
__ifThenElse(transformExp(c), reflectBlock(a), reflectBlock(b))
case ArrayFill(n,i,y) =>
arrayFill(transformExp(n), { j => withSubstScope(i -> j) { reflectBlock(y) }})
case _ => ...
}
}
The staged transformer facility can be extended slightly to translate not only
within a single language but also between two languages:
trait FlexTransformer {
val SRC: Expressions
val DST: Base
trait TypeTransform[A,B]
var subst: Map[SRC.Exp[_],DST.Rep[_]]
def transformExp[A,B](s: SRC.Exp[A])(implicit t: TypeTransform[A,B]): DST.Rep[B]
}
It is also possible to add more abstraction on top of the base transforms to
build combinators for rewriting strategies in the style of Stratego
[(*)](spoofax) or Kiama [(*)](DBLP:conf/gttse/Sloane09).
# Problem: Phase Ordering
This all works but is not completely satisfactory. With fine grained separate
transformations we immediately run into phase ordering problems
[(*)](veldhuizen:combiningoptimizations,click95combineanalysis). We could
execute optimization passes in a loop until we reach a fixpoint but even then
we may miss opportunities if the program contains loops. For best results,
optimizations need to be tightly integrated. Optimizations need a different
mechanisms than lowering transformations that have a clearly defined before
and after model. In the next chapter, we will thus consider a slightly
different IR representation.
# (Chapter 2) Intermediate Representation: Graphs
\label{chap:320graphs}
To remedy phase ordering problems and overall allow for more flexibility in
rearranging program pieces, we switch to a program representation based on
structured graphs. This representation is not to be confused with control-flow
graphs (CFGs): Since one of our main goals is parallelization, a sequential
CFG would not be a good fit.
# Purely Functional Subset
\label{sec:303purefun}
Let us first consider a purely functional language subset. There are much more
possibilities for aggressive optimizations. We can rely on referential
transparency: The value of an expression is always the same, no matter when
and where it is computed. Thus, optimizations do not need to check
availability or lifetimes of expressions. Global common subexpression
elimination (CSE), pattern rewrites, dead code elimination (DCE) and code
motion are considerably simpler than the usual implementations for imperative
programs.
We switch to a ''sea of nodes''-like [(*)](DBLP:conf/irep/ClickP95)
representation that is a directed (and for the moment, acyclic) graph:
trait Expressions {
// expressions (atomic)
abstract class Exp[T]
case class Const[T](x: T) extends Exp[T]
case class Sym[T](n: Int) extends Exp[T]
def fresh[T]: Sym[T]
// definitions (composite, subclasses provided by other traits)
abstract class Def[T]
// blocks -- no direct links to statements
case class Block[T](res: Exp[T])
// bind definitions to symbols automatically
// by creating a statement
implicit def toAtom[T](d: Def[T]): Exp[T] =
reflectPure(d)
def reifyBlock[T](b: =>Exp[T]): Block[T]
def reflectPure[T](d: Def[T]): Sym[T] =
findOrCreateDefinition(d)
def findDefinition[T](s: Sym[T]): Option[Def[T]]
def findDefinition[T](d: Def[T]): Option[Sym[T]]
def findOrCreateDefinition[T](d: Def[T]): Sym[T]
}
It is instructive to compare the definition of trait `Expressions` with the
one from the previous [chapter](#chap:310trees). Again there are three
categories of objects involved: expressions, which are atomic (subclasses of
`Exp`: constants and symbols; with a ''gensym'' operator `fresh` to create
fresh symbols), definitions, which represent composite operations (subclasses
of `Def`, to be provided by other components), and blocks, which model nested
scopes.
Trait `Expressions` now provides methods to find a definition given a symbol
or vice versa. Direct links between blocks and statements are removed. The
actual graph nodes are `(Sym[T], Def[T])` pairs. They need not be accessible
to clients at this level. Thus method `reflectStm` from the previous chapter
is replaced by `reflectPure`.
Graphs also carry nesting information (boundSyms, see below). This enables
code motion for different kinds of nested expressions such as lambdas, not
only for loops or conditionals. The structured graph representation is also
more appropriate for parallel execution than the traditional sequential
control-flow graph. Pure computation can float freely in the graph and can be
scheduled for execution anywhere.
# Modularity: Adding IR Node Types
The object language implementation code is the same compared to the tree
representation:
trait BaseExp extends Base with Expressions {
type Rep[T] = Exp[T]
}
Again, we have separate traits, one for each unit of functionality:
trait ArithExp extends BaseExp with Arith {
case class Plus(x: Exp[Double], y: Exp[Double]) extends Def[Double]
case class Times(x: Exp[Double], y: Exp[Double]) extends Def[Double]
def infix_+(x: Rep[Double], y: Rep[Double]) = Plus(x, y)
def infix_*(x: Rep[Double], y: Rep[Double]) = Times(x, y)
...
}
trait IfThenElseExp extends BaseExp with IfThenElse {
case class IfThenElse(c: Exp[Boolean], a: Block[T], b: Block[T]) extends Def[T]
def __ifThenElse[T](c: Rep[Boolean], a: =>Rep[T], b: =>Rep[T]): Rep[T] =
IfThenElse(c, reifyBlock(a), reifyBlock(b))
}
# Simpler Analysis and More Flexible Transformations
Several optimizations are very simple to implement on this purely functional
graph IR. The implementation draws inspiration from previous work on compiling
embedded DSLs [(*)](DBLP:conf/saig/ElliottFM00,DBLP:conf/dsl/LeijenM99) as
well as staged FFT kernels [(*)](DBLP:conf/emsoft/KiselyovST04).
## Common Subexpression Elimination/Global Value Numbering
\label{sec:320cse}
Common subexpressions are eliminated during IR construction using hash
consing:
def findOrCreateDefinition[T](d: Def[T]): Sym[T]
Invoked by `reflectPure` through the implicit conversion method `toAtom`,
this method converts a definition to an atomic expression and links it to the
scope being built up by the innermost enclosing `reifyBlock` call. When the
definition is known to be side-effect free, it will search the already
encountered definitions for a structurally equivalent one. If a matching
previous definition is found, its symbol will be returned, possibly moving the
definition to a parent scope to make it accessible. If the definition may
have side effects or it is seen for the first time, it will be associated
with a fresh symbol and saved for future reference. This simple scheme
provides a powerful global value numbering optimization
[(*)](DBLP:conf/pldi/Click95) that effectively prevents generating duplicate
code.
## Pattern Rewrites
Using `findDefinition`, we can implement an extractor object
[(*)](DBLP:conf/ecoop/EmirOW07) that enables pattern matching on a symbol to
lookup the underlying definition associated to the symbol:
object Def {
def unapply[T](s: Exp[T]): Option[Def[T]] = s match {
case s: Sym[T] => findDefinition(s)
case _ => None
}
}
This extractor object can be used to implement smart
constructors for IR nodes that deeply inspect their arguments:
def infix_*(x: Exp[Double], y: Exp[Double]) = (x,y) match {
case (Const(x), Const(y)) => Const(x * y)
case (Const(k), Def(Times(Const(l), y))) => Const(k * l) * y
case _ => Times(x,y)
}
Smart constructors are a simple yet powerful rewriting facility. If the smart
constructor is the only way to construct `Times` nodes we obtain a strong
guarantee: No `Times` node is ever created without applying all possible
rewrites first.
## Modularity: Adding new Optimizations
\label{sec:308addOpts}
Some profitable optimizations, such as the global value numbering described
above, are very generic. Other optimizations apply only to specific aspects of
functionality, for example particular implementations of constant folding (or
more generally symbolic rewritings) such as replacing computations like
`x * 1.0` with `x`. Yet other optimizations are specific to the actual program
being staged. Kiselyov et al. [(*)](DBLP:conf/emsoft/KiselyovST04) describe a
number of rewritings that are particularly effective for the patterns of code
generated by a staged FFT algorithm but not as much for other programs. The
FFT example is discussed in more detail [here](#sec:Afft).
What we want to achieve again is modularity, so that optimizations can be
combined in a way that is most useful for a given task. To implement a
particular rewriting rule (whether specific or generic), say, `x * 1.0`
$\rightarrow$ `x`, we can provide a specialized implementation of `infix_*`
(overriding the one in trait `ArithExp`) that will test its arguments for a
particular pattern. How this can be done in a modular way is shown by the
traits `ArithExpOpt` and `ArithExpOptFFT`, which implement some generic and
program specific optimizations. Note that the use of `x*y` within the body of
`infix_*` will apply the optimization recursively.
The appropriate pattern is to override the smart constructor in a separate
trait and call the super implementation if no rewrite matches. This decouples
optimizations from node type definitions.
trait ArithExpOpt extends ArithExp {
override def infix_*(x:Exp[Double],y:Exp[Double]) = (x,y) match {
case (Const(x), Const(y)) => Const(x * y)
case (x, Const(1)) => x
case (Const(1), y) => x
case _ => super.infix_*(x, y)
}
}
trait ArithExpOptFFT extends ArithExp {
override def infix_*(x:Exp[Double],y:Exp[Double]) = (x,y) match {
case (Const(k), Def(Times(Const(l), y))) => Const(k * l) * y
case (x, Def(Times(Const(k), y))) => Const(k) * (x * y))
case (Def(Times(Const(k), x)), y) => Const(k) * (x * y))
...
case (x, Const(y)) => Const(y) * x
case _ => super.infix_*(x, y)
}
}
Note that the trait linearization order defines the rewriting strategy. We
still maintain our guarantee that no `Times` node could be rewritten further.
\begin{figure*}\centering
\includegraphics[width=\textwidth]{papers/cacm2012/figoverview.pdf}
\caption{\label{fig:overview}Component architecture. Arrows denote extends relationships,
dashed boxes represent units of functionality.}
\vspace{1cm}
\end{figure*}
Figure~\ref{fig:overview} shows the component architecture formed by base
traits and corresponding optimizations.
## Context- and Flow-Sensitive Transformations
Context and flow sensitive transformation become very important once we
introduce effects. But even pure functional programs can profit from context
information:
if (c) { if (c) a else b } else ...
The inner test on the same condition is redundant and will always succeed. How
do we detect this situation? In other cases we can use the Def extractor to
lookup the definition of a symbol. This will not work here, because Def works
on Exp input and produces a Def object as output. We however need to work on
the level of Exps, turning a Sym into `Const(true)` based on context
information.
We need to adapt the way we construct IR nodes. When we enter the then
branch, we add `c$\rightarrow$Const(true)` to a substitution. This
substitution needs to be applied to arguments of IR nodes constructed within
the then branch.
One possible solution would be add yet another type constructor, `Ref`, with
an implicit conversion from Exp to Ref that applies the substitution. A
signature like `IfThenElse(c: Exp, ...)` would become `IfThenElse(c: Ref,
...)`. A simpler solution is to implement `toAtom` in such a way that it
checks the resulting Def if any of its inputs need substitution and if so
invoke `mirror` (see below) on the result Def, which will apply the
substitution, call the appropriate smart constructor and finally call `toAtom`
again with the transformed result.
## Graph Transformations
In addition to optimizations performed during graph constructions, we can also
implement transformation that work directly on the graph structure. This is
useful if we need to do analysis on a larger portion of the program first and
only then apply the transformation. An example would be to find all `FooBar`
statements in a graph, and replace them uniformly with `BarBaz`. All dependent
statements should re-execute their pattern rewrites, which might trigger on
the new `BarBaz` input.
We introduce the concept of _mirroring_: Given an IR node, we want to apply a
substitution (or generally, a `Transformer`) to the arguments and call the
appropriate smart constructor again. For every IR node type we require a
default `mirror` implementation that calls back its smart constructor:
override def mirror[A](e: Def[A], f: Transformer): Exp[A] = e match {
case Plus(a,b) => f(a) + f(b) // calls infix_+
case Times(a,b) => f(a) * f(b)
case _ => super.mirror(e,f)
}
There are some restrictions if we are working directly on the graph level: In
general we have no (or only limited) context information because a single IR
node may occur multiple times in the final program. Thus, attempting to
simplify effectful or otherwise context-dependent expressions will produce
wrong results without an appropriate context. For pure expressions, a smart
constructor called from `mirror` should not create new symbols apart from the
result and it should not call reifyBlock. Otherwise, if we were creating new
symbols when nothing changes, the returned symbol could not be used to check
convergence of an iterative transformation easily.
The `Transfomer` argument to `mirror` can be
queried to find out whether `mirror` is allowed to call context
dependent methods:
override def mirror[A](e: Def[A], f: Transformer): Exp[A] = e match {
case IfThenElse(c,a,b) =>
if (f.hasContext)
__ifThenElse(f(c),f.reflectBlock(a),f.reflectBlock(b))
else
ifThenElse(f(c),f(a),f(b)) // context-free version
case _ => super.mirror(e,f)
}
If the context is guaranteed to be set up correctly, we call the regular smart
constructor and use `f.reflectBlock` to call mirror recursively on the
contents of blocks `a` and `b`. Otherwise, we call a more restricted context
free method.
## Dead Code Elimination
Dead code elimination can be performed purely on the graph level, simply by
finding all statements reachable from the final result and discarding
everything else.
We define a method to find all symbols a given object references directly:
def syms(x: Any): List[Sym[Any]]
If `x` is a Sym itself, `syms(x)` will return `List(x)`. For a case class
instance that implements the `Product` interface such as `Times(a,b)`, it
will return `List(a,b)` if both `a` and `b` are Syms. Since the argument type
is `Any`, we can apply `syms` not only to Def objects directly but also to
lists of Defs, for example.
Then, assuming `R` is the final program result, the set of remaining symbols
in the graph `G` is the least fixpoint of:
G = R $\cup$ syms(G map findDefinition)
Dead code elimination will discard all other nodes.
# From Graphs Back to Trees
To turn program graphs back into trees for code generation we have to decide
which graph nodes should go where in the resulting program. This is the task
of code motion.
## Code Motion
\label{sec:320codemotion}
Other optimizations can apply transformations optimistically and need not
worry about maintaining a correct schedule: Code motion will fix it up. The
algorithm will try to push statements inside conditional branches and hoist
statements out of loops. Code motion depends on dependency and frequency
information but not directly on data-flow information. Thus it can treat
functions or other user defined compound statements in the same way as loops.
This makes our algorithm different from code motion algorithms based on data
flow analysis such as Lazy Code Motion (LCM, [(*)](DBLP:conf/pldi/KnoopRS92))
or Partial Redundancy Elimination (PRE, [(*)](DBLP:journals/toplas/KennedyCLLTC99)).
The graph IR reflects ''must before'' (ordering) and ''must inside''
(containment) relations, as well as anti-dependence and frequency. These
relations are implemented by the following methods, which can be overridden
for new definition classes:
def syms(e: Any): List[Sym[Any]] // value dependence (must before)
def softSyms(e: Any): List[Sym[Any]] // anti dependence (must not after)
def boundSyms(e: Any): List[Sym[Any]] // nesting dependence (must not outside)
def symsFreq(e: Any): List[(Sym[Any], // frequency information (classify
Double)] // sym as 'hot', 'normal', 'cold')
To give an example, `boundSyms` applied to a loop node
`RangeForeach(range,idx,body)` with index variable `idx` would return
`List(idx)` to denote that `idx` is fixed ''inside'' the loop expression.
Given a subgraph and a list of result nodes, the goal is to identify the graph
nodes that should form the ''current'' level, as opposed to those that should
remain in some ''inner'' scope, to be scheduled later. We will reason about
the paths on which statements can be reached from the result. The first idea
is to retain all nodes on the current level that are reachable on a path that
does not cross any conditionals, i.e. that has no ''cold'' refs. Nodes only
used from conditionals will be pushed down. However, this approach does not
yet reflect the precedence of loops. If a loop is top-level, then
conditionals inside the loop (even if deeply nested) should not prevent
hoisting of statements. So we refine the characterization to retain all nodes
that are reachable on a path that does not cross top-level conditionals.
\begin{figure}
\begin{center}
\includegraphics[width=0.7\textwidth]{fig_graph_nesting.pdf}
\end{center}
\caption{\label{fig:codemotion}Graph IR with regular and nesting edges (boundSyms, dotted line) as
used for code motion.}
\end{figure}
This leads to a simple iterative algorithm: Starting with the known top level
statements, nodes reachable via normal links are added and for each hot ref,
we follow nodes that are forced inside until we reach one that can become top-
level again.
Code Motion Algorithm: Compute the set $L$ of top level statements for the
current block, from a set of available statements $E$, a set of forced-inside
statements $G \subseteq E$ and a block result $R$.
1. Start with $L$ containing the known top level statements, initially the
(available) block result $R \cap E$.
2. Add to $L$ all nodes reachable from $L$ via normal links (neither hot nor
cold) through $E-G$ (not forced inside).
3. For each hot ref from $L$ to a statement in $E-L$, follow any links through
$G$, i.e. the nodes that are forced inside, if there are any. The first
non-forced-inside nodes (the ''hot fringe'') become top level as well
(add to $L$).
4. Continue with 2 until a fixpoint is reached.
To implement this algorithm, we need to determine the set `G` of nodes that
are forced inside and may not be part of the top level. We start with the
block result `R` and a graph `E` that has all unnecessary nodes removed (DCE
already performed):
E = R $\cup$ syms(E map findDefinition)
We then need a way to find all uses of a given symbol `s`, up to but not
including the node where the symbol is bound:
U(s) = {s} $\cup$ { g $\in$ E | syms(findDefinition(g)) $\cap$ U(s) $\ne\emptyset$
&& s $\notin$ boundSyms(findDefinition(g))) }
We collect all bound symbols and their dependencies. These cannot live on the
current level, they are forced inside:
B = boundSyms (E map findDefinition)
G = union (B map U) // must inside
Computing `U(s)` for many symbols `s` individually is costly but
implementations can exploit considerable sharing to optimize the computation
of `G`.
The iteration above uses `G` to follow forced-inside nodes after a hot ref
until a node is found that can be moved to the top level.
Let us consider a few examples to build some intuition about the code motion
behavior. In the code below, the starred conditional is on the fringe (first
statement that can be outside) and on a hot path (through the loop). Thus it
will be hoisted. Statement `foo` will be moved inside:
loop { i => z = if* (x) foo
if (i > 0) loop { i =>
if* (x) if (i > 0)
foo z
} }
The situation changes if the inner conditional is forced inside by a value
dependency. Now statement `foo` is on the hot fringe and becomes top level.
loop { i => z = foo*
if (x) loop { i =>
if (i > 0) if (x)
foo* if (i > 0)
} z
}
For loops inside conditionals, the containing statements will be moved inside
(relative to the current level).
if (x) if (x)
loop { i => z = foo
foo loop { i =>
} z
}
### Pathological Cases
The described algorithm works well and is reasonably efficient in practice.
Being a heuristic, it cannot be optimal in all cases. Future versions could
employ more elaborate cost models instead of the simple hot/cold distinction.
One case worth mentioning is when a statement is used only in conditionals
but in different conditionals:
z = foo if (x)
if (x) foo
z if (y)
if (y) foo
z
In this situation `foo` will be duplicated. Often this duplication is
beneficial because `foo` can be optimized together with other statements
inside the branches. In general of course there is a danger of slowing down
the program if both conditions are likely to be true at the same time. In that
case it would be a good idea anyways to restructure the program to factor out
the common criteria into a separate test.
### Scheduling
Once we have determined which statements should occur on which level, we have
to come up with an ordering for the statements. Before starting the code
motion algorithm, we sort the input graph in topological order and we will use
the same order for the final result. For the purpose of sorting, we include
anti-dependencies in the topological sort although they are disregarded during
dead code elimination. A bit of care must be taken though: If we introduce
loops or recursive functions the graph can be cyclic, in which case no
topological order exists. However, cycles are caused only by inner nodes
pointing back to outer nodes and for sorting purposes we can remove these
back-edges to obtain an acyclic graph.
## Tree-Like Traversals and Transformers
To generate code or to perform transformation by iterated staging (see
[above](#sec:310treeTrans)) we need to turn our graph back into a tree. The
interface to code motion allows us to build a generic tree-like traversal over
our graph structure:
trait Traversal {
val IR: Expressions; import IR._
// perform code motion, maintaining current scope
def focusExactScope(r: Exp[Any])(body: List[Stm[Any]] => A): A
// client interface
def traverseBlock[T](b: Block[T]): Unit =
focusExactScope(b.res) { levelScope =>
levelScope.foreach(traverseStm)
}
def traverseStm[T](s: Stm[T]): Unit = blocks(s).foreach(traverseBlock)
}
This is useful for other analyses as well, but in particular for building
transformers that traverse one graph in a tree like fashion and create another
graph analogous to the appraoach above [above](#sec:310treeTrans). The
implementation of trait `ForwardTransformer` carries over almost unmodified.