Writing an interpreter, CESK-style

[article index] [] [@mattmight] [rss]

Matthias Felleisen's CESK machine[1] provides a simple yet powerful architecture for implementing interpreters (among many other benefits).

The CESK approach easily models languages with rich features like:

  • mutation;
  • recursion;
  • exceptions;
  • first-class continuations;
  • garbage collection; and
  • multi-threading.

The host language need not support any of these features.

The CESK machine is a state-machine in which each state has four components: a (C)ontrol component, an (E)nvironment, a (S)tore and a (K)ontinuation. One might imagine these respectively as the instruction pointer, the local variables, the heap and the stack.

This article discusses how to build a CESK machine for A-Normalized lambda calculus (ANF), a high-level intermediate representation for functional programs.

A working interpreter is provided in Racket.


See page 60 of Matthias Felleisen's dissertation for a definition.

Machine-based interpreters

The CESK machine is a machine-based interpreter.

(Most would actually call it a semantics, since it is formally defined.)

At a high-level, a machine-based interpreter has four components:

  1. $\mathit{Prog}$ -- the set of programs.
  2. $\Sigma$ -- the set of machine states.
  3. $\mathit{inject} : \mathit{Prog} \to \Sigma$ -- a program to initial-state injection function.
  4. $\mathit{step} : \Sigma \rightharpoonup \Sigma$ -- a (partial) state to state transition function.

Given a program $p \in \mathit{Prog}$, the interpreter first injects it into an initial machine state $\varsigma_0$:

\[ \varsigma_0 = \mathit{inject}(p)\text. \]

The algorithm for running the interpreter is then simple:

 ς := ς0
 while ς is defined for step:
    ς := step(ς)

A note on determinism

The structure of the $\mathit{step}$ function assumes deterministic evaluation. Nondeterministic evaluation requires a function that yields multiple potential successor states: $\mathit{step} : \Sigma \rightarrow \mathcal{P}(\Sigma)$.

C, E, S and K

The CESK machine takes its name from the four components that comprise a state of its execution: the control string, the environment, the store and the continuation.

C

Depending on the language being interpreted, the control string could be as simple as a program counter, a statement or an expression.

In this article, the control string is an expression.

E

The environment is a structure, almost always a map, that associates variables with an address in the store.

The environment can be implemented as a purely functional map (hash- or tree-based) or even directly as a first-class function.

S

The store, which some might liken to a heap or memory, is a map from addresses to values.

Like the environment, the store can be a map (hash- or tree-based) or a first-class function.

K

The continuation component is a representation of the program stack, often times represented exactly as a list of frames, or as an implicitly linked list.

A-Normal Form

A-Normal Form is a normalized variant of the lambda-calculus.

Transforming a language to ANF is straightforward, and it simplifies the structure of an interpreter.

Here's a sample BNF grammar for a reasonable variant on ANF:

 lam ::= (λ (var1 ... varN) exp)

 aexp ::= lam
       |  var
       |  #t  |  #f
       |  integer
       |  (prim aexp1 ... aexpN)

 cexp ::= (aexp0 aexp1 ... aexpN)
       |  (if aexp exp exp)
       |  (call/cc aexp)
       |  (set! var aexp)
       |  (letrec ((var1 aexp1) ... (varN aexpN)) exp)

 exp ::= aexp
      |  cexp
      |  (let ((var exp)) exp)

 prim ::= +  |  -  |  *  |  =

There are three kinds of expressions:

  • Atomic expressions (aexp) are those for which evaluation must always terminate, never cause an error and never produce a side effect.
  • Complex expressions (cexp) may not terminate, may produce an error and may have a side effect. However, a complex expression may defer execution to only one other complex expression. For instance, letrec defers directly to its body, and if defers to only one of its arms.
  • Expressions (exp) can be atomic, complex or let-bound. A let-bound expression will first defer execution to the binding expression, and then resume execution in the body.

This structure forces order of evaluation to be specified syntactically.

For instance, the meaning of the expression ((f x) (g y)) is undefined until we know whether (f x) or (g y) is executed first. In ANF, this expression is illegal, and must be written:

(let ((fx (f x)))
 (let ((gy (g y)))
   (fx gy)))
or
(let ((gy (g y)))
 (let ((fx (f x)))
   (fx gy)))

so that there is no ambiguity.

A formal definition

A formal definition of the CESK machine guides the code.

If you're unfamiliar with formal mathematical notation, you may want to review my article on the connection between discrete mathematics and code.

If you're only interested in running code, skip ahead.

The state-space, $\Sigma$, of the CESK machine for ANF has four components:

\[ \varsigma \in \Sigma = \mathsf{Exp} \times \mathit{Env} \times \mathit{Store} \times \mathit{Kont} \text. \]

In case it's not clear, $\mathsf{Exp}$ is the set of of all expressions defined by the earlier grammar. Also, the notation $\varsigma \in \Sigma$ is a hint that the symbol $\varsigma$ will be used to denote members of the set $\Sigma$.

Environments

The environment in a machine state is a partial function that maps a variable to its address:

\[ \rho \in \mathit{Env} = \mathsf{Var} \rightharpoonup \mathit{Addr} \]

It has to be a partial function, because not all variables are in every scope.

Once again, the hint $\rho \in \mathit{Env}$ indicates that the symbol $\rho$ will be used to denote environments.

Stores

A store maps addresses to values:

\[ \sigma \in \mathit{Store} = \mathit{Addr} \rightharpoonup \mathit{Value} \text. \]

In a CESK machine, variable look-up is a two-stage process: first to an address (through some environment), then to a value (through the store).

Values

There are five kinds of values in this machine--void, booleans, integers, closures and first-class continuations:

\[ \mathit{val} \in \mathit{Value} ::= \mathbf{void} \mathrel{|} \mathit{z} \mathrel{|} \mathtt{\#t} \mathrel{|} \mathtt{\#f} \mathrel{|} \mathbf{clo}(\mathit{lam}, \rho) \mathrel{|} \mathbf{cont}(\kappa) \]

In the set of values, $z$ is an integer, while $\mathtt{\#t}$ and $\mathtt{\#f}$ are booleans.

A closure pairs a lambda term with an environment to define the values of its free variables. The environment is necessary because a term like (λ () x) is undefined, unless an environment specifies the value of x.

Continuations are included in values because the language includes call/cc, which enables the creation of first-class continuations.

Continuations

A continuation is effectively the program stack.

Creating a continuation allows us to divert to a complex sub-computation and return later.

So, a continuation needs enough information to resume execution.

For this machine, diverting to a sub-computation only happens in let-bound expressions.

Given a let-bound expression (let ([v exp]) body), execution will first go to exp, which means that when it finishes evaluating exp, the result will bind to v, and execution will resume with body.

In a CESK machine, the assumption is that the current computation is always executing on behalf of some continuation awaiting its result. (The special initial continuation, which awaits the result of the program, is called $\mathbf{halt}$.)

Consequently, continuations nest within continuations.

Finally, every continuation must contain the local environment that knows the addresses of the variables in scope.

Putting this all together lets us formally define the space of continuations:

\[ \kappa \in \mathit{Kont} ::= \mathbf{letk}(v, \rho, e, \kappa) \mathrel{|} \mathbf{halt} \]

Evaluating atomic expressions

Atomic expressions ($\mathit{aexp}$ in the grammar) are easy to evaluate with an auxilary semantic function, $\mathcal{A} : \mathsf{AExp} \times \mathit{Env} \times \mathit{Store} \rightharpoonup \mathit{Value}$:

Variables get looked up in the environment:

\[ \mathcal{A}(v, \rho, \sigma) = \sigma(\rho(v)) \]

Integers evaluate to themselves:

\[ \mathcal{A}(z, \rho, \sigma) = z \text. \]

Booleans do too:

\[ \mathcal{A}( \mathtt{\#t}, \rho, \sigma) = \#t \text; \] \[ \mathcal{A}( \mathtt{\#f}, \rho, \sigma) = \#f \text. \]

Lambda terms become closures:

\[ \mathcal{A}( \mathit{lam}, \rho, \sigma) = \mathbf{clo}( \mathit{lam}, \rho ) \text. \]

Primitive expressions are evaluated recursively:

\[ \mathcal{A}( \mathtt{(} \mathit{prim}\; \mathit{aexp}_1 \ldots \mathit{aexp}_n \mathtt{)}, \rho, \sigma) = \\ \mathcal{O}(\mathit{prim}) \langle \mathcal{A}( \mathit{aexp}_1, \rho, \sigma), \ldots, \mathcal{A}( \mathit{aexp}_n, \rho, \sigma) \rangle \text. \]

where $\mathcal{O} : \mathit{Prim} \to (\mathit{Value}^* \rightharpoonup \mathit{Value})$ maps a primitive operation (by name) to its corresponding operation.

Stepping forward

To define the step function, $\mathit{step} : \Sigma \rightharpoonup \Sigma$, for this machine, we need a case for each expression type.

Procedure call

In a procedure call, the $\mathit{step}$ function first evaluates the expression for procedure to be invoked, and then the expressions for the arguments to be supplied.

Then it applies that procedure to those arguments.

\[ \mathit{step}( \mathtt{(} \mathit{aexp}_0 \; \mathit{aexp}_1 \ldots \mathit{aexp}_n \mathtt{)}, \rho, \sigma, \kappa ) = \\ applyproc(proc,\langle \mathit{value}_1, \ldots, \mathit{value}_n \rangle,\sigma,\kappa) \text{, where} \\ \begin{array}{l} \mathit{proc} = \mathcal{A}(\mathit{aexp}_0, \rho, \sigma) \\ \mathit{value}_i = \mathcal{A}(\mathit{aexp}_i, \rho, \sigma) \end{array} \] and $applyproc : \mathit{Value} \times \mathit{Value}^* \times \mathit{Store} \times \mathit{Kont} \rightharpoonup \Sigma$ is an auxiliary function (defined below) that applies a procedure to a value.

Return

When the expression under evaluation is an atomic expression, it indicates that the current sub-computation is finished and we need to return the result to the current continuation, which has been patiently awaiting it:

\[ \mathit{step}(\mathit{aexp}, \rho, \sigma, \kappa) = \mathit{applykont}(\kappa, \mathcal{A}(\mathit{aexp}, \rho, \sigma), \sigma) \text, \]

where the auxilary function $\mathit{applykont} : \mathit{Kont} \times \mathit{Value} \times \mathit{Store} \rightharpoonup \mathit{State}$ (defined below) applies a continuation to a value.

Conditionals

Conditional evaluation is straightforward: the condition is evaluated, and the right expression is chosen for the next state.

\[ \mathit{step}( \mathtt{(if}\; \mathit{aexp}\; \mathit{e}_{\mathrm{true}} \; \mathit{e}_{\mathrm{false}} \mathtt{)}, \rho, \sigma, \kappa) = \begin{cases} ( \mathit{e}_{\mathrm{true}}, \rho, \sigma, \kappa ) & \mathcal{A}(\mathit{aexp}, \rho, \sigma) \neq \mathtt{\#f} \\ ( \mathit{e}_{\mathrm{false}}, \rho, \sigma, \kappa ) & \text{otherwise.} \end{cases} \]

Let

Evaluating let will force the creation of a continuation.

Since execution first evaluates the bound expression, the continuation will contain enough information to resume execution in the body of the let.

\[ \mathit{step}( \mathtt{(let\; ([}\mathit{v}\; \mathit{exp} \mathtt{])}\; \mathit{body} \mathtt{)}, \rho, \sigma, \kappa) = (\mathit{exp}, \rho, \sigma, \kappa')\text, \]

where $\kappa' = \mathbf{letk}(v,\mathit{body}, \rho, \kappa)$.

Mutation

The CESK approach makes mutation straightforward: look up the address to be changed, and then overwrite that address in the store. \[ \mathit{step}( \mathtt{(set!}\; \mathit{v}\; \mathit{aexp} \mathtt{)}, \rho, \sigma, \kappa) = \mathit{applykont}(\kappa,\mathbf{void},\sigma') \text, \] where $\sigma' = \sigma[\rho(v) \mapsto \mathcal{A}(\mathit{aexp}, \rho, \sigma)]$.

Notation. Given a function (or partial function) $f : X \to Y$, the function $f[x \mapsto y]$ is identical to $f$ except that $x$ yields $y$:

\[ (f[x \mapsto y])(x) = y \] \[ (f[x \mapsto y])(x') = f(x') \text{ if } x \neq x' \text. \]

Recursion

Handling recursion requires establishing self-reference. In a language like Scheme, the construct letrec is often compiled into "lets and sets"; that is:

 (letrec ([v1 exp1] 
          ...
          [vN expN])
   body)

becomes:

 (let ([v1 (void)] 
       ...
       [vN (void)])
  (set! v1 exp1)
  ...
  (set! vN expN)
  body)

A CESK machine can fake this by extending the environment first, and then evaluating the expressions in the context of the extended environment:

\[ \mathit{step}( \mathtt{(letrec\; (} \mathtt{[}\mathit{v}_1\; \mathit{aexp}_1\mathtt{]} \ldots \mathtt{[}\mathit{v}_n\; \mathit{aexp}_n\mathtt{]} \mathtt{)}\; \mathit{body} \mathtt{)}, \rho, \sigma, \kappa) = (\mathit{body}, \rho', \sigma', \kappa)\text, \]

where:

\[ a_1,\ldots,a_n \text{ are fresh addresses in } \sigma \\ \rho' = \rho[v_i \mapsto a_i] \\ \mathit{value}_i = \mathcal{A}(\mathit{aexp}_i, \rho', \sigma) \\ \sigma' = \sigma[a_i \mapsto \mathit{value}_i] \text. \]

First-class continuations

First-class continuations are a powerful construct, since they allow the simulation of so many other control constructs. For instance, exceptions are merely syntactic sugar on top of continuations.

And, continuations can do many other things too.

The procedure call/cc captures the current continuation as a first-class procedure:

\[ \mathit{step}( \mathtt{(call/cc}\; \mathit{aexp} \mathtt{)}, \rho, \sigma, \kappa ) = \\ applyproc(proc,\langle \mathit{value}_{\mathrm{cc}} \rangle,\sigma,\kappa) \text{, where} \\ \begin{array}{l} \mathit{proc} = \mathcal{A}(\mathit{aexp}, \rho, \sigma) \\ \mathit{value}_{\mathrm{cc}} = \mathbf{cont}(\kappa) \text. \end{array} \]

Applying procedures

The auxiliar function $applyproc : \mathit{Value} \times \mathit{Value}^* \times \mathit{Store} \times \mathit{Kont} \rightharpoonup \Sigma$ applies a procedure to a value.

\[ \mathit{applyproc}(\mathbf{clo}( \mathtt{(\lambda\; (\mathit{v}_1 \ldots \mathit{v_n})\; \mathit{body})},\rho) \langle \mathit{value}_1, \ldots \mathit{value}_n \rangle, \sigma, \kappa ) = \\ (\mathit{body}, \rho', \sigma', \kappa) \text{, where } \] \[ a_1,\ldots,a_n \text{ are fresh addresses in } \sigma \\ \rho' = \rho[v_i \mapsto a_i] \\ \sigma' = \sigma[a_i \mapsto \mathit{value}_i] \text. \]

Applying continuations

The auxilary function $\mathit{applykont} : \mathit{Kont} \times \mathit{Value} \times \mathit{Store} \rightharpoonup \mathit{State}$ applies a continuation to a return value:

\[ \mathit{applykont}( \mathbf{letk}(v,e,\rho,\kappa), \mathit{value}, \sigma) = (e, \rho[v \mapsto \mathit{a}], \sigma[a \mapsto \mathit{value}], \kappa) \text, \]

where $a \not\in \mathit{dom}(\sigma)$ is a fresh address.

As running code

I've transliterated the math here directly into working Racket code for a CESK interpreter.

Further reading

There are a few good books on implementing compilers and interpreters for functional languages.

The classic MIT text, Structure and Interpretation of Computer Programs, is worth the read:

Lisp in Small Pieces is a consistent recommendation in the courses I teach:

For advanced techniques, Appel's Compiling with Continuations remains my favorite reference:

Related pages