What is static program analysis?

The halting problem asks whether the execution of a specific program for a given input will terminate.

The halting problem is famous for being undecidable.

That is, no algorithm can solve it for all programs and all inputs.

This complicates any attempt to predict program behavior: we can make predicting almost any program behavior equivalent to predicting the termination of a nearly identical program.

Static analyses are algorithms that do their best to defy the undecidability of the halting problem: they attempt to predict program behavior.

Predicting program behavior enables program optimization, security audits, automatic parallelization and, if accurate enough, correctness verification.

The article below illustrates some basic principles behind static analysis, and then builds an analyzer for a register-based machine language in Racket.

It does so without invoking the usual jargon associated with static analysis: lattices, monotonicity or fixed points.

Update: I gave a talk that roughly follows this post at Lambda Lounge:

The problem with the halting problem

The halting problem makes reasoning about program behavior difficult.

Suppose, for instance, that you want to know whether a program might have an out-of-bounds error for an array at the following expression:

  a[i]

First, change all possible exit points in the program into infinite loops.

Now the program is guaranteed to non-terminate.

Then, change the indexing expression into:

 ((i >= 0 && i < a.length) ? a[i] : exit())

If we could solve the halting problem, then we could use it to check for array bounds errors.

Consider the other direction: suppose we have a special program that can determine the presence of array-bounds errors.

Perform the same transformation from above on indexing expressions.

This eliminates all array-bounds errors from the program by transforming them into termination.

Now, transform every original exit point into an explicit array-bounds error, e.g., a[a.length+10].

If the the array-bounds error-checker finds an out-of-bounds error, then it has determined that the original program halts.

If it claims the program is free of such errors, then it has determined that the original program does not halt.

If we can find array-bounds errors, then we can solve the halting problem.

Guaranteeing the safety of array bounds and deciding program termination are equivalent in difficulty.

Both are undecidable.

Evading the halting problem

Evading the halting problem rests on a simple observation: it is not possible to construct a general algorithm for the halting problem for all programs and all inputs.

But, both of the following are possible:

an algorithm that works on some programs for some inputs; and
an algorithm that works on some programs for all inputs.

If the algorithm can say "yes, it halts," "no, it doesn't" or "I don't know," then the halting problem is solvable.

After all, it could always say "I don't know."

Example: Analyzing sign

In Cousot and Cousot's classic introduction of static analysis by abstract interpretation, they begin with the example of sign analysis.

It's an example worth repeating, because it highlights one of the key principles of static analysis: abstraction.

Given an arithmetic expression, it is possible to bound its sign -- negative, zero or positive -- without actually computing it.

We can prove that $4 \times -3$ is negative without computing that it is $-12$.

Let $S$ be the set of signs:

\[ S = \{\mathtt{-},\mathtt{0},\mathtt{+}\}\text. \]

We can define abstract integers as sets of these signs:

\[ \hat{\mathbb{Z}} = \mathbb{P}(S) \text. \]

For any integer, we can convert it to a precise abstraction of its sign using an abstraction function $\alpha : \mathbb{Z} \mapsto \hat{\mathbb{Z}}$:

\[ \alpha(z) = \begin{cases} \{-\} & z < 0 \\ \{0\} & z = 0 \\ \{+\} & z > 0 \text. \end{cases} \]

We can also define addition and multiplication on these abstract integers: $\oplus$ and $\otimes$.

For example, $\{-,0\} \oplus \{-\} = \{-\}$ and $\{-\} \oplus \{+\} = \{-,0,+\}$. And, $\{-\} \otimes \{+,0\} = \{-,0\}$ and $\{-,+\} \otimes \{0\} = \{0\}$.

Thus, to analyze $4 \times -3$, we convert it to: $\alpha(4) \otimes \alpha(-3)$ to get $\{-\}$.

Of course, sometimes we don't have enough precision to compute the answer; consider: $-4 + 4$: $\alpha(-4) \oplus \alpha(4) = \{-,0,+\}$, even though $\alpha(-4 + 4) = \{0\}$.

In this case, if asked for the sign of the resulting expression, the analyzer must report "I don't know."

The principle of abstraction

The key principle at play in static analysis (and explicitly so in static analysis by abstract interpretation) is to provide an approximate interpretation -- an abstract interpretation -- of a program under an abstraction.

In many cases -- like the abstraction of signs -- these abstractions are finite. Finite abstractions guarantee the approximate interpretation will terminate. (There are techniques for using infinite abstractions while ensuring termination, but these are specific to each abstraction.)

A simple language to analyze

To tie the toy example of signs from earlier to static analysis of programs, we're going to define a (very) simple register machine language.

We'll construct a sign analysis of the registers in this language. The analysis will be able to bound the sign of every register at every statement.

The grammar for this machine contains statements and expressions:

 <prog> ::= <stmt> ...

 <stmt> ::= <label> :
         |  goto <label> ;
         |  <var> := <exp> ;
         |  if <exp> goto <label> ;

 <exp> ::= <exp> + <exp>
        |  <exp> * <exp>
        |  <exp> = <exp>
        |  <int>
        |  <var>

For example, the following program computes the factorial of n:

         a := 1 ;
 top:    if n = 0 goto done ;
         a := a * n ;
         n := n + -1 ;
         goto top ;
 done:

To simplify the manipulation of programs, we'll encode them according to this s-expression grammar:

 <prog> ::= <stmt> ...

 <stmt> ::= (label <label>)
         |  (goto <label>) 
         |  (:= <var> <exp>)
         |  (if <exp> goto <label>)

 <exp> ::= (+ <exp> <exp>)
        |  (* <exp> <exp>)
        |  (= <exp> <exp>)
        |  <int>
        |  <var>

A interpreter or analyzer can read and parse s-expressions with Racket's (read) procedure.

We're going to preprocess an input program by constructing a hash table that maps labels to lists of statements:

; stmt-map : label => stmt*
(define stmt-map (make-hash))

; constructs the label-to-statements map:
(define (preprocess stmts)
  (match stmts
    [(cons `(label ,label) rest)
     (hash-set! stmt-map label stmts)
     (preprocess rest)]
    
    [(cons _ rest)
     (preprocess rest)]
    
    ['()
     (void)]))

This table will assist in the interpretation of goto statements.

A simple interpreter

To analyze programs in this language, we first need a concrete semantics.

A concrete semantics is an interpreter that avoids the use of effects.

When a program avoids using effects--when it is purely functional--it is no longer a program; it is a mathematical artefact.

From this mathematical artefact, we can derive and prove the correctness of static program analyses.

The meaning of expressions

It's easiest to define the meaning of expressions: we need a function to map an expression to its value.

Of course, expressions contain variables, so that function needs to take a map from variables to values. We call that map an environment, and in code, we will model environments as immutable hash maps.

Putting it all together:

; exp-eval : exp env -> integer
(define (exp-eval exp env)
  (match exp
    [(? symbol?)
     ;=>
     (hash-ref env exp)]
    
    [(? integer?)
     ;=>
     exp]
    
    [`(+ ,exp1 ,exp2)
     ;=>
     (+ (exp-eval exp1 env)
        (exp-eval exp2 env))]
    
    [`(* ,exp1 ,exp2)
     ;=>
     (* (exp-eval exp1 env)
        (exp-eval exp2 env))]
    
    [`(= ,exp1 ,exp2)
     ;=>
     (if (= (exp-eval exp1 env)
            (exp-eval exp2 env))
         1
         0)]))

The meaning of statements

To define the meaning of statements, we first need to define a machine state.

A machine state pairs a list of statements to be executed with an environment:

(struct state {stmts env})

To run a program -- a list of statements -- we inject it into an initial state:

(define (inject prog)
  (state prog env0))

Execution proceeds as steps from one state to the next, until termination:

(define (run prog)
  (preprocess prog)
  (define current-state (inject prog))
  (while (state? current-state)
    (set! current-state (step current-state))))

The current statement determines the step itself:

(define (step state0)
  
  (define stmts (state-stmts state0))
  
  (define env (state-env state0))
  
  (match stmts
    ['()
     ;=>
     env]
    
    [(cons `(label ,l) rest)
     ;=>
     (state rest env)]
    
    [(cons `(:= ,var ,exp) rest)
     ;=>
     (define env* (hash-set env var (exp-eval exp env)))
     (state rest env*)]
    
    [(cons `(if ,exp goto ,label) rest)
     ;=>
     (define condition (exp-eval exp env))
     (if (not (= condition 0))
         (state (hash-ref stmt-map label) env)
         (state rest env))]
    
    [(cons `(goto ,label) rest)
     ;=>
     (state (hash-ref stmt-map label) env)]
         
    [else
     ;=>
     (error (format "unknown instruction: ~a!" (car stmts)))]))

A simple analyzer

The analyzer for the simple register language is going to mimic the interpreter.

But, we're going to replace integers with the abstraction over signs.

It's useful in this case to define that abstraction in code:

(define (α n)
  (cond
    [(< n 0)  {set '-}]
    [(= n 0)  {set  0}]
    [(> n 0)  {set '+}]))

Next, we'll have to define arithmetic over this abstraction:

(define (+/abstract an1 an2)
  (apply set-union
         (for*/list ([s1 an1]
                    [s2 an2])
           (+/α s1 s2))))

(define (+/α s1 s2)
  (match* (s1 s2)
    [('- '-)  {set '-}]
    [('-  0)  {set '-}]
    [('- '+)  {set '- 0 '+}]
    [('0  x)  {set x}]
    [('+ '-)  {set '- 0 '+}]
    [('+  0)  {set '+}]
    [('+ '+)  {set '+}]))

(define (*/abstract an1 an2)
  (apply set-union
         (for*/list ([s1 an1]
                    [s2 an2])
           (*/α s1 s2))))

(define (*/α s1 s2)
  (match* (s1 s2)
    [('- '-)  {set '+}]
    [('-  0)  {set  0}]
    [('- '+)  {set '-}]
    [('0  x)  {set  0}]
    [('+ '-)  {set '-}]
    [('+  0)  {set  0}]
    [('+ '+)  {set '+}]))

Analyzing expressions

The abstract expression evaluator is virtually identical to its concrete cousin:

; exp-eval : exp aenv -> abstract-integer
(define (exp-aeval exp aenv)
  (match exp
    [(? symbol?)
     ;=>
     (hash-ref aenv exp)]
    
    [(? integer?)
     ;=>
     (α exp)]
    
    [`(+ ,exp1 ,exp2)
     ;=>
     (+/abstract (exp-aeval exp1 aenv)
                 (exp-aeval exp2 aenv))]
    
    [`(* ,exp1 ,exp2)
     ;=>
     (*/abstract (exp-aeval exp1 aenv)
                 (exp-aeval exp2 aenv))]
        
    [`(= ,exp1 ,exp2)
     ;=>
     {set 0 '+}]))

Analyzing statements

Abstract states have the same structure as concrete states; in code, we mark them transparent to make them eligible for structural equality comparisons:

(struct astate {stmts aenv}
  #:transparent)

Steps in the abstract look virtually the same:

; astep : astate -> astate*
(define (astep astate0)
  
  (define stmts (astate-stmts astate0))
  
  (define aenv (astate-aenv astate0))
  
  (match stmts
    ['()
     ;=>
     aenv]
    
    [(cons `(label ,l) rest)
     ;=>
     (list (astate rest aenv))]
    
    [(cons `(:= ,var ,exp) rest)
     ;=>
     (define aenv* (hash-set aenv var (exp-aeval exp aenv)))
     (list (astate rest aenv*))]
    
    [(cons `(if ,exp goto ,label) rest)
     ;=>
     (define condition (exp-aeval exp aenv))
     (list
      (astate (hash-ref stmt-map label) aenv)
      (astate rest aenv))]
    
    [(cons `(goto ,label) rest)
     ;=>
     (list 
      (astate (hash-ref stmt-map label) aenv))]
         
    [else
     ;=>
     (error (format "unknown instruction: ~a!" (car stmts)))]))

Analyzing programs

The biggest change from the concrete interpreter to the abstract interpreter is from the run to the analyze procedure:

(define (analyze prog)
  
  (preprocess prog)
  
  ; the initial abstract state:
  (define astate0 (ainject prog))
  
  ; the set of all states ever seen:
  (define visited (set))
  
  ; marks a state as seen:
  (define (mark-seen! astate)
    (set! visited (set-add visited astate)))
  
  ; checks if a state is seen:
  (define (seen? astate)
    (set-member? visited astate))
  
  ; states to explore next:
  (define todo (list astate0))
  
  ; adds states to be explored:
  (define (push-todos astates)
    (set! todo (append astates todo)))
  
  ; grabs the next state to be explored:
  (define (pop-todo)
    (define next (car todo))
    (set! todo (cdr todo))
    next)
  
  (while (not (null? todo))
    (define curr (pop-todo))
    (when (not (seen? curr))
      (mark-seen! curr)
      (define succs (astep curr))
      (when (list? succs)
        (push-todos succs))))
  
  ; return all visited states:
  visited)

The analyze procedure performs a graph-exploration of the astep function, starting with the initial abstract state: (ainject prog).

What have we built?

At the end of the analysis, you get back a collection of visited states.

From this collection, it is possible to conservatively bound the sign of all registers at every program point.

For a given statement $s$ and a given variable $v$, you could find every abstract state at $s$, and then look up $v$ in its environment to find its possible signs.

Even a toy analysis like this not necessarily useless in practice.

In a language like Java, the compiler must insert checks at every array indexing operation for both under- and over-flow.

Simple sign analysis can frequently eliminate all under-flow checks.

What's next?

As one adds features to a language, it ramps up the complexity of the interpreter.

The complexity of an analyzer rises in tandem.

Research in static analysis focuses on combatting and exploiting that rise.

As a field of study, static analysis aims to analyze richer languages in less time with higher precision while recovering deeper kinds of facts.

Code

Code for all of the above is available in Racket.

Exercises

The abstract expression evaluator could be more precise. It naively returns the set $\{0,+\}$ for any comparison. But, there could be enough information to compute an "exact" outcome -- $\{0\}$ or $\{+\}$. Improve the exp-aeval procedure to compute the exact answer when enough information is available from the sign abstraction over integers.
The abstract step function could be more precise. There may be enough information in the abstraction of the conditional value to branch in only one direction, instead of both. Improve the astep function to branch in only one direction whenever possible for a conditional jump.
There is a further way to improve the precision of the abstract step function. Even when there is not enough information to branch in only one direction, it may be possible to modify the environment on either branch to reflect the information picked up from the conditional. For instance, if the conditional is (= x 0), then on the false branch, the environment should not report 0 as a possible value of x.
There is further room to improve the precision of the abstract expression evaluator. Consider an expression like (* x x). This expression can only yield a positive value. Or, if x is strictly positive, then (+ x -1) cannot be negative. Incorporate these rules and one other into the abstract expression evaluator.