Writing CEK-style interpreters (or semantics) in Haskell

[article index] [] [@mattmight] [+mattmight] [rss]

The CEK machine is a mechanical model of the lambda-calculus developed by Matthias Felleisen and Dan Friedman.

As a mechanical model, the CEK machine provides a realistic yet abstract description of how a computer would efficiently execute a program.

According to Google image search, this is also a CEK machine.

CEK machines are a natural design strategy for writing interpreters.

Besides being efficient, CEK-style interpreters have many useful properties:

  • It's easy to add complex language features like continuations.
  • It's easy to freeze interpretation and get a stack trace.
  • It's easy to introduce threads into the language (or fake them!)
  • And, it's easy to step execution in a debugger.

My collaborator David Van Horn and I have recently discovered that CEK machines make an excellent starting point for static analysis as well.

Read below for an introduction to the CEK machine with code in Haskell.

[If you like the tip of this iceberg, I strongly recommend the recent book on Semantics Engineering by Felleisen, Findler and Flatt.]

Background: λ-calculus

If you're not yet familiar with the lambda-calculus, or if you don't fully appreciate its universality as a programming language, you might want to first read a couple other blog posts before charging ahead:

Or, grab Benjamin Pierce's Types and Programming Languages, which has a full, detailed treatment of the lambda-calculus:

For the machine below, we'll import a Syntax module containing an abstract syntax tree datatype for the lambda-calculus:

type Var = String

data Lambda = Var :=> Exp

data Exp = Ref Var      
         | Lam Lambda   
         | Exp :@ Exp  

This table shows how to encode expressions from the lambda-calculus:

λ-calculus   Exp encoding
v Ref "v"
λv . e Lam ("v" :=> e)
f(e) f :@ e

A word on idiomatic Haskell

Most of the Haskell in this article is not idiomatic.

For examples, data constructors take tuples instead of expanding the components into fields.

I took this route to keep the code as aesthetically close to the underlying formal mathematics as possible.

The setting: State machines

Before we implement an interpreter as a CEK machine, let's look at state machines in general, and how they can encode an interpreter.

Mathematically, a state machine is a state-space paired with a transition relation that relates each state to its subsequent states.

In code, a state-space is a type. Values of that type are states.

We'll use the typename Σ for the state-space.

In the simple case, which holds for the CEK machine, the transition relation is deterministic, so a function from one state to another models it fully.

We'll use the name step for the transition relation rendered as a function:

step :: Σ -> Σ

To "execute" a deterministic state-machine, we want to move from state to state until we hit a "final" state.

The function terminal does this:

terminal :: (Σ -> Σ) -> (Σ -> Bool) -> Σ -> Σ
terminal step isFinal ς0 | isFinal ς0 = ς0
                         | otherwise  = terminal step isFinal (step(ς0))

The function terminal takes the deterministic transition relation step; the predicate isFinal that indicates whether a state has no successor; and the initial state ς0 from which to start.

To model an interpreter as a state machine, we need an inject function that maps a program into a state:

inject :: Program -> Σ

In the case of the lambda-calculus, type Program = Exp.

Then, to use terminal, we need a final-state detector:

isFinal :: Σ -> Bool

Now it's easy to define the evaluation function:

evaluate :: Program -> Σ
evaluate pr = terminal step isFinal (inject(pr)) 

The evaluator maps a program to its final state (or never terminates if the program is non-terminating).

The CEK machine

There are only three components of a state in a CEK machine: a control component, an environment component, and a continuation component.

The control component is the expression currently being evaluated.

The environment component is a map from variables to values. The environment is the local evaluation context of the expression.

The continuation component is a stack of frames. The continuation is the dynamic evaluation context of the expression. Each frame marks a context awaiting the value of an expression.

In Haskell, the state-space (Σ) is a triple:

type Σ    = (Exp,Env,Kont)
data D    = Clo (Lambda, Env)
type Env  = Var -> D
data Kont = Mt
          | Ar (Exp,Env,Kont)
          | Fn (Lambda,Env,Kont)

The type D contains values. For the basic lambda-calculus, there is only one kind of value: the closure. A closure is a lambda term paired with an environment that defines the values of its free variables.

If we wanted to add more values to the language, like integers or strings, they would show up as variants in the datatype D.

(The choice of the name D is rooted in the history of programming languages. It means both the "domain of values" and "denotable values.")

There are three kinds of continuations: the empty continuation (Mt), the "I hold an argument to evaluate" continuation (Ar) and the "I contain an evaluated function, and now I'm evaluating an argument term" continuation (Fn).

To use this machine, we first inject expressions into states:

inject :: Exp -> Σ
inject (e) = (e, ρ0, Mt)
 where ρ0 :: Env
       ρ0 = \ x -> error $ "no binding for " ++ x

The step function moves execution forward by one step:

step :: Σ -> Σ

step (Ref x, ρ, κ)                    
   = (Lam lam,ρ',κ) where Clo (lam, ρ') = ρ(x)

step (f :@ e, ρ, κ)                  
   = (f, ρ,  Ar(e, ρ, κ))

step (Lam lam, ρ, Ar(e, ρ', κ))      
   = (e, ρ', Fn(lam, ρ, κ))

step (Lam lam, ρ, Fn(x :=> e, ρ', κ)) 
   = (e, ρ' // [x ==> Clo (lam, ρ)], κ)

A few auxiliary definitions handle function extension in this code:

(==>) :: a -> b -> (a,b)
(==>) x y = (x,y)

(//) :: Eq a => (a -> b) -> [(a,b)] -> (a -> b)
(//) f [(x,y)] = \ x' ->
                 if (x == x') 
                 then y
                 else f(x')

Given a function f :: a -> b, the function f // [x ==> y] is the identical to the function f, except that if given x it returns y.

We can describe the four cases for step informally:

  • Evaluating a reference? Look it up in the environment.
  • Evaluating a function application? First evaluate the function.
  • Evaluated the function? Go evaluate the argument term.
  • Evaluated the argument too? Perform the application.

Recognizing final states is easy--just check for the empty continuation:

isFinal :: Σ -> Bool
isFinal (Lam _, ρ, Mt) = True
isFinal _              = False

And, that's it.

Adding more language features consists mostly of adding new kinds of expressions and continuations corresponding to their contexts.

To handle side effects in an interpreter, you can use side effects in the host language, or if the host language is pure (like math or Haskell), you can upgrade to a CESK machine.

I'll treat CESK machines in a future post.

More resources

Semantics Engineering with PLT Redex is an entire book by Matthias Felleisen, Robby Findler and Matthew Flatt on modeling programming languages (and their complex features) with machines:

The best part? The formal mathematics are encoded in the Redex module for Racket so you can execute the mathematics.

Redex also comes with automated testing tools that can (and will) find bugs in your language implementation.

The classic MIT textbook Structure and Interpretation of Computer Programs is essentially a textbook on interpreters:

Related posts