The control-flow problem
In programming languages, and static analysis in particular, the higher-order control-flow problem refers to the fact that the precise target of a function call may not be obvious.
The problem afflicts functional languages in the form of first-class functions, and it afflicts object-oriented languages in the form of dynamically dispatched methods.
For example, in an object-oriented language, the target of the
function call object.method()
depends on the value that flows to the expression object
.
Thus, the control-flow of this call depends on the
data-flow affecting the value of object
.
If the object object
were a function parameter, it's
clear that the control-flow of this function would also determine the
data-flow of its parameters.
As another example, the target of a function call like (g x)
in Lisp depends on where this call appears.
This call could appear in the term (lambda (g) (g x))
,
in which case the target of g
depends on where this procedure flows.
Consider, further, the control-flow possibilities resulting from
the term
(map (lambda (g) (g x)) my-procedures)
.
In summary, in higher-order languages, control-flow affects data-flow, even as data-flow affects control-flow. Any attempt to reason about one facet must grapple with the other.
The value-flow problem
In practice, the scope of the control-flow problem is always widened
to the more general value-flow problem.
To contrast, the control-flow problem asks, "Which procedures may the expression
f
be in the call (f x)
?"; the value-flow
problem asks "To which values may the expressions f
and x
evaluate?"
The challenge of the value-flow problem is that, in general, it is undecidable to determine the exact set of values to which an expression might evaluate.
Solution: Simulate the program
Control-flow analyses (CFAs) are the class of algorithms that
solve the value-flow problem.
Because a precise solution is impossible, CFAs compute conservative
over-approximations.
That is, if a CFA says that the procedure foo
is invoked
at some call site, then it may be invoked at that call site.
If foo
can't actually be invoked, then it is a false
positive.
CFAs attempt to minimize the number of false positives without
resorting to intractable techniques.
The simplest way to write a CFA is to use a "small-step abstract interpretation." The code for an abstract interpretater can be quite similar to that of an ordinary interpreter; in practice, there are three major differences:
- Abstract interpreters use abstract values.
For instance, one might use
positive
ornegative
, instead of3
or-4
. Another common strategy has all pointer addresses allocated at the same program location share the same "abstract" pointer address. - Abstract interpreters can be non-deterministic; often,
when an abstract interpreter encounters a conditional (e.g.
if
), it follows all branches. If a CFA thinks there could be two procedures called at a call site, it will non-deterministically invoke both of them during its analysis. - Abstract interpreters typically operate over a finite number of machine states, to ensure termination. In instances where abstract interpreters don't use a finite state-space, a special technique called widening can accelerate and guarantee termination of the interpreter.
In essence, an abstract interpreter simulates all possible (and, necessarily, some impossible) executions of the program. By looking over all possible executions, one can check whether there is any execution in which an expression gets bound to a number or a string. Similarly, one can look at every state containing a call site of interest, to see which procedures may actually get called there.
The first popular solution to the control-flow problem was k-CFA, a hierarchy of increasingly precise (and increasingly slower) analyses. k-CFA was originally specified as an abstract interpreter for Scheme that had been translated into continuation-passing style. (An implementation of k-CFA for that kind of Scheme is included below.)
The core principles of CFAs translate easily between different languages. That is, once you understand how to implement a CFA for Scheme, it's not hard to implement a CFA for any other language.
How 0CFA thinks
0CFA is a special case of the k-CFA framework, and it is one of the most popular CFAs. 0CFA is notable for the simplicity of its abstraction: values are abstracted to the syntax from which they came. In Scheme, a procedure is abstracted to the lambda term that created it; the environment associated with its closure is ignored entirely.
This coarse abstraction leads to considerable imprecision, but it has a major benefit in the form of a polynomial-time bound (cubic, in fact) for an efficient abstract interpreter.
For the pure lambda calculus, as described by the following grammar:
f,e ::= (lambda (v) e) | (f e) | v v is a variable
0CFA could be thought of as filling in a "flows to" relation according to three rules:
-
For each lambda term
(lambda (v) e)
:(lambda (v) e)
flows to(lambda (v) e)
. -
For each application
(f e)
, if the value(lambda (v) e')
flows to the functionf
and the value value flows to the argumente
, thenvalue flows tov
. -
For each application
(f e)
, if the value(lambda (v) e')
flows to the functionf
and the value value flows to bodye'
, thenvalue flows to(f e)
.
A word about k-CFA
k-CFA improves the precision of the flow analysis by
considering in what context a value flows to an expression.
For instance, 1CFA might say, "(lambda (v) e)
flows to
the variable x
, when the procedure (lambda (x)
e')
is called from the site (g h)
."
Where to learn more
If you'd like to get into the details of how CFAs work, there are a few good papers and books out there. Here's what I recommend:
- Principles of Program Analysis covers many topics in static program analysis, including abstract interpretation, CFAs and data-flow analysis. The book covers both imperative and functional languages. It describes techniques so standard you'll find them in GCC, but it also contains analyses so sophisticated that no one has ever implemented them in a commercial compiler.
- The k-CFA algorithm is well-explained in the first three chapters of Olin Shivers's Ph.D. dissertation. (Olin discovered k-CFA.)
- A modern (simpler) formulation of k-CFA is given in the first five chapters of my own Ph.D. dissertation.
- I recently published a paper describing k-CFA for object-oriented languages. Email me if you'd like a preprint! (The camera-ready version will be posted here soon.)
- I teach courses and seminars on program analysis. If you're near the University of Utah, please drop by, or come get your M.S. or Ph.D.!
A reference implementation of k-CFA in R5RS Scheme
The implementation below of k-CFA is a tutorial/reference implementation. That is, it's purely functional, it uses convenient data structures over efficient data structures, and it doesn't use any form of widening to accelerate convergence. (In fact, it is exponential time instead of cubic.)
The implementation is based on the mathematics found in my Ph.D. dissertation. It uses a small-step abstract interpreter that transforms an abstract machine state into possible successor states. To compute the analysis, it crawls the graph of machine states (starting from an initial state) until all potentially reachable states have been found. The resulting set of reachable states is then summarized to yield a composite abstract heap. This abstract heap is further summarized to produce a map from variables to the lambda terms which may flow to them.
A reference implementation of k-CFA in PLT Scheme
Translation courtesy of Jay McCarthy.
An implementation in Java
Courtesy of Jens Nicolay.