k-CFA: Determining types and/or control-flow in languages like Python, Java and Scheme

[article index] [] [@mattmight] [rss]

Is it possible to assign a type to an expression in a dynamic language like Python or JavaScript? Is it possible to predict which method will be invoked under dynamic dispatch in an object-oriented language like Java? Is it possible to know which function will be called at a higher-order call site in a language like Scheme?

In all three cases, the answer is yes (with a caveat).

There is a class of algorithms--little known outside of academia--that can analyze a program to compute conservative answers to the questions above. These algorithms are called CFAs. (The expansion of the acronym CFA--control-flow analysis--is misleading, because it no longer sufficiently captures what these algorithms do.)

For CFAs, conservative means that if a CFA infers the type of an expression to be "integer or float," then when the program runs, the expression may be an integer or floating point number, or it may be both, but it will never be any other type, such as a string or an object.

Being able to statically bound types, control-flow and dynamic dispatch has important applications to optimization, security, automatic parallelization and even program verification.

This article discusses the difficulty of analyzing dynamic and/or higher-order languages, describes CFAs as a solution, and provides a purely functional reference implementation of the most popular family of CFAs, k-CFA. The article focuses mostly on Scheme, but the principles discussed apply directly to languages like JavaScript, Python and Java.

The control-flow problem

In programming languages, and static analysis in particular, the higher-order control-flow problem refers to the fact that the precise target of a function call may not be obvious.

The problem afflicts functional languages in the form of first-class functions, and it afflicts object-oriented languages in the form of dynamically dispatched methods.

For example, in an object-oriented language, the target of the function call object.method() depends on the value that flows to the expression object. Thus, the control-flow of this call depends on the data-flow affecting the value of object. If the object object were a function parameter, it's clear that the control-flow of this function would also determine the data-flow of its parameters.

As another example, the target of a function call like (g x) in Lisp depends on where this call appears. This call could appear in the term (lambda (g) (g x)), in which case the target of g depends on where this procedure flows. Consider, further, the control-flow possibilities resulting from the term (map (lambda (g) (g x)) my-procedures).

In summary, in higher-order languages, control-flow affects data-flow, even as data-flow affects control-flow. Any attempt to reason about one facet must grapple with the other.

The value-flow problem

In practice, the scope of the control-flow problem is always widened to the more general value-flow problem. To contrast, the control-flow problem asks, "Which procedures may the expression f be in the call (f x)?"; the value-flow problem asks "To which values may the expressions f and x evaluate?"

The challenge of the value-flow problem is that, in general, it is undecidable to determine the exact set of values to which an expression might evaluate.

Solution: Simulate the program

Control-flow analyses (CFAs) are the class of algorithms that solve the value-flow problem. Because a precise solution is impossible, CFAs compute conservative over-approximations. That is, if a CFA says that the procedure foo is invoked at some call site, then it may be invoked at that call site. If foo can't actually be invoked, then it is a false positive. CFAs attempt to minimize the number of false positives without resorting to intractable techniques.

The simplest way to write a CFA is to use a "small-step abstract interpretation." The code for an abstract interpretater can be quite similar to that of an ordinary interpreter; in practice, there are three major differences:

Abstract interpreters use abstract values. For instance, one might use positive or negative, instead of 3 or -4. Another common strategy has all pointer addresses allocated at the same program location share the same "abstract" pointer address.
Abstract interpreters can be non-deterministic; often, when an abstract interpreter encounters a conditional (e.g. if), it follows all branches. If a CFA thinks there could be two procedures called at a call site, it will non-deterministically invoke both of them during its analysis.
Abstract interpreters typically operate over a finite number of machine states, to ensure termination. In instances where abstract interpreters don't use a finite state-space, a special technique called widening can accelerate and guarantee termination of the interpreter.

In essence, an abstract interpreter simulates all possible (and, necessarily, some impossible) executions of the program. By looking over all possible executions, one can check whether there is any execution in which an expression gets bound to a number or a string. Similarly, one can look at every state containing a call site of interest, to see which procedures may actually get called there.

The first popular solution to the control-flow problem was k-CFA, a hierarchy of increasingly precise (and increasingly slower) analyses. k-CFA was originally specified as an abstract interpreter for Scheme that had been translated into continuation-passing style. (An implementation of k-CFA for that kind of Scheme is included below.)

The core principles of CFAs translate easily between different languages. That is, once you understand how to implement a CFA for Scheme, it's not hard to implement a CFA for any other language.

How 0CFA thinks

0CFA is a special case of the k-CFA framework, and it is one of the most popular CFAs. 0CFA is notable for the simplicity of its abstraction: values are abstracted to the syntax from which they came. In Scheme, a procedure is abstracted to the lambda term that created it; the environment associated with its closure is ignored entirely.

This coarse abstraction leads to considerable imprecision, but it has a major benefit in the form of a polynomial-time bound (cubic, in fact) for an efficient abstract interpreter.

For the pure lambda calculus, as described by the following grammar:

 f,e ::= (lambda (v) e)
      |  (f e)
      |  v

   v is a variable

0CFA could be thought of as filling in a "flows to" relation according to three rules:

For each lambda term (lambda (v) e):
(lambda (v) e) flows to (lambda (v) e).
For each application (f e), if the value (lambda (v) e') flows to the function f and the value value flows to the argument e, then
value flows to v.
For each application (f e), if the value (lambda (v) e') flows to the function f and the value value flows to body e', then
value flows to (f e).

A word about k-CFA

k-CFA improves the precision of the flow analysis by considering in what context a value flows to an expression. For instance, 1CFA might say, "(lambda (v) e) flows to the variable x, when the procedure (lambda (x) e') is called from the site (g h)."

Where to learn more

If you'd like to get into the details of how CFAs work, there are a few good papers and books out there. Here's what I recommend:

Principles of Program Analysis covers many topics in static program analysis, including abstract interpretation, CFAs and data-flow analysis. The book covers both imperative and functional languages. It describes techniques so standard you'll find them in GCC, but it also contains analyses so sophisticated that no one has ever implemented them in a commercial compiler.
The k-CFA algorithm is well-explained in the first three chapters of Olin Shivers's Ph.D. dissertation. (Olin discovered k-CFA.)
A modern (simpler) formulation of k-CFA is given in the first five chapters of my own Ph.D. dissertation.
I recently published a paper describing k-CFA for object-oriented languages. Email me if you'd like a preprint! (The camera-ready version will be posted here soon.)
I teach courses and seminars on program analysis. If you're near the University of Utah, please drop by, or come get your M.S. or Ph.D.!

A reference implementation of k-CFA in R5RS Scheme

The implementation below of k-CFA is a tutorial/reference implementation. That is, it's purely functional, it uses convenient data structures over efficient data structures, and it doesn't use any form of widening to accelerate convergence. (In fact, it is exponential time instead of cubic.)

The implementation is based on the mathematics found in my Ph.D. dissertation. It uses a small-step abstract interpreter that transforms an abstract machine state into possible successor states. To compute the analysis, it crawls the graph of machine states (starting from an initial state) until all potentially reachable states have been found. The resulting set of reachable states is then summarized to yield a composite abstract heap. This abstract heap is further summarized to produce a map from variables to the lambda terms which may flow to them.

A reference implementation of k-CFA in PLT Scheme

Translation courtesy of Jay McCarthy.

An implementation in Java

Courtesy of Jens Nicolay.

k-CFA in Java.