-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy pathIntroToDFA.tex
190 lines (134 loc) · 9.67 KB
/
IntroToDFA.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
\newpage
\section{Introduction to Data Flow Analysis}
\subsection{Motivation for Dataflow Analysis}
Some optimizations\footnote{based on \url{https://pages.cs.wisc.edu/~horwitz/CS704-NOTES/2.DATAFLOW.html}} , however, require more "global" information.
For example, consider the code \ref{lst:expr1}
\begin{lstlisting}[language=C,frame=single, caption=An example to illustrate some optimizations require more global information,label = lst:expr1]
a = 1;
b = 2;
c = 3;
if (...) x = a + 5;
else x = b + 4;
c = x + 1;
\end{lstlisting}
In this example, the initial assignment to \texttt{c} (at line 3) is useless, and the expression
\texttt{x + 1} can be simplified to 7, but it is less obvious how a compiler can discover these facts
since they cannot be discovered by looking only at one or two consecutive statements.
A more global analysis is needed so that the compiler knows at each point in the program:
\begin{itemize}
\item which variables are guaranteed to have constant values, and
\item which variables will be used before being redefined.
\end{itemize}
To discover these kinds of properties, we use dataflow analysis.
\subsubsection{What is Data Flow Analysis?}
Local Optimizations only consider optimizations within a node in CFG.
Data flow analysis will take edges into account, which means composing
effects of basic blocks to derive information at basic block boundaries.
Data-flow analysis is a technique for gathering information about the possible
set of values calculated at various points in a computer program. A program's
control-flow graph (CFG) is used to determine those parts of a program to which
a particular value assigned to a variable might propagate. The information gathered
is often used by compilers when optimizing a program.
Typically, we will do local optimization for the first step to know what happens in a
basic block, step 2 is to do data flow analysis. In the third step, we will go back and
revisit the individual instructions inside of the blocks.
Data flow analysis is \textbf{flow-sensitive}, which means we take into account
the effect of control flow. It is also a \textbf{intraprocedural analysis} which means
the analysis is within a procedure.\footnote{\textbf{Interprocedural analysis} uses calling relationships among
procedures} Data-flow analysis computes its solutions over the paths in
a control-flow graph. The well-known, meet-over-all-paths
formulation produces safe, precise solutions for general dataflow problems. All paths-whether feasible or infeasible,
heavily or rarely executed-contribute equally to a solution.
Here are some examples of intraprocedural optimizations:
\begin{itemize}
\item \textbf{constant propagation}. Constant propagation is a well-known global flow analysis
problem. The goal of constant propagation is to discover values that are constant on all possible
executions of a program and to propagate these constant values as far forward through the program
as possible. Expressions whose operands are all constants can be evaluated at compile time and the
results propagated further.
\item \textbf{common subexpression elimination} CSE is a compiler
optimization that searches for instances of identical expressions
(i.e., they all evaluate to the same value),
and analyzes whether it is worthwhile replacing them with a
single variable holding the computed value.
\item \textbf{dead code elimination}. Actually, source code written by programmers doesn't contain
a lot of dead code, dead code happens to occur partly because of how the front end translates code into
the IR. Doing optimizations will also turn code into dead.
\end{itemize}
% \subsection{Static Program vs. Dynamic Execution }
% Static program
\subsubsection{Static Program vs. Dynamic Execution}
Program is statically finite, but there can be infinite many dynamic execution paths. On one hand, analysis
need to be precise, so we will take into account as much dynamic execution as possible. On the other hand, analysis
need to do the analysis quickly. For a compromise, the analysis result is \textbf{conservative} and what it does is for each
point in the program, combines information of all the instances of the same program point.
\subsubsection{Data Flow Analysis Schema}
Before thinking about how to define a dataflow problem, note that there are two kinds of problems:
\begin{itemize}
\item Forward problems (like constant propagation) where the information at a node n summarizes what can happen on paths from "enter" to n. \textbf{So if we care about what happened in the past, it's a forward problem.}
\item Backward problems (like live-variable analysis), where the information at a node n summarizes what can happen on paths from n to "exit". \textbf{So if we care about what will happen in the future, it's a backward problem.}
\end{itemize}
In what follows, we will assume that we're thinking about a forward problem unless otherwise specified.
Another way that many common dataflow problems can be categorized is as may problems or must problems.
The solution to a "may" problem provides information about what may be true at each program point (e.g.,
for live-variables analysis, a variable is considered live after node n if its value may be used before
being overwritten, while for constant propagation, the pair (x, v) holds before node n if x must have the value v at that point).
Now let's think about how to define a dataflow problem so that it's clear what the (best) solution should be.
When we do dataflow analysis "by hand", we look at the CFG and think about:
\begin{itemize}
\item What information holds at the start of the program.
\item When a node n has more than one incoming edge in the CFG, how to combine the incoming
information (i.e., given the information that holds after each predecessor of n, how to
combine that information to determine what holds before n).
\item How the execution of each node changes the information.
\end{itemize}
This intuition leads to the following definition. An instance of a dataflow problem includes:
\begin{itemize}
\item a \(CFG\),
\item a domain \(D\) of "dataflow facts",
\item a dataflow fact "init" (the information true at the start of the program for forward problems,
or at the end of the program for backward problems),
\item an operator \(\wedge\) (used to combine incoming information from multiple predecessors),
\item for each CFG node n, a dataflow function \(f_n\) :\( D \rightarrow D\) (that defines the effect of
executing n).
\end{itemize}
For constant propagation, an individual dataflow fact is a set of pairs of the form (var, val),
so the domain of dataflow facts is the set of all such sets of pairs (the power set).
For live-variable analysis, it is the power set of the set of variables in the program.
For both constant propagation and live-variable analysis, the "init" fact is the empty set
(no variable starts with a constant value, and no variables are live at the end of the program).
For constant propagation, the combining operation \(\wedge\) is set intersection.
This is because if a node n has two predecessors, p1 and p2, then variable x has value v before
node n iff it has value v after both p1 and p2. For live-variable analysis,
\(\wedge\) is set union: if a node n has two successors, s1 and s2, then the value of x after n may be
used before being overwritten iff that holds either before s1 or before s2. In general,
for "may" dataflow problems, \(\wedge\) will be some union-like operator, while it will be an intersection-like
operator for "must" problems.
For constant propagation, the dataflow function associated with a CFG node that does not assign
to any variable (e.g., a predicate) is the identity function. For a node n that assigns to
a variable x, there are two possibilities:
\begin{itemize}
\item 1. The right-hand side has a variable that is not constant. In this case, the function
result is the same as its input except that if variable x was constant the before n,
it is not constant after n.
\item 2. All right-hand-side variables have constant values. In this case, the right-hand side of
the assignment is evaluated producing consant-value c, and the dataflow-function result is the
same as its input except that it includes the pair (x, c) for variable x (and excludes the pair
for x, if any, that was in the input).
\end{itemize}
For live-variable analysis, the dataflow function for each node n has the form:
\(f_n(S) = Gen_n \cup (S - KILL_n)\), where \(KILL_n\) is the set of variables defined at node n,
and \(GEN_n\) is the set of variables used at node n. In other words, for a node that does not
assign to any variable, the variables that are live before n are those that are live after
n plus those that are used at n; for a node that assigns to variable x, the variables that are
live before n are those that are live after n except x, plus those that are used at n
(including x if it is used at n as well as being defined there).
An equivalent way of formulating the dataflow functions for live-variable analysis is:
\(f_n(S) = (S \cap NOT-KILL_n) \cup GEN_n\), where \(NOT-KILL_n\) is the set of variables not defined
at node n. The advantage of this formulation is that it permits the dataflow facts to be
represented using bit vectors, and the dataflow functions to be implemented using simple
bit-vector operations (and or).
It turns out that a number of interesting dataflow problems have dataflow functions of this
same form, where \(GEN_n\) and \(KILL_n\) are sets whose definition depends only on n, and the combining
operator \(\wedge\) is either union or intersection. These problems are called GEN/KILL problems,
or bit-vector problems.