Advanced Topics in Compilation

Instructor	Matt Might
Catalog num.	15536
Time	Tues. and Thurs. 10:45am - 12:05pm
Office hours	Tues. 12:05pm - 2:05pm. Or, drop by any time my door is open. Or, set up an appointment by email.
Location	MEB 3105
Google group	Advanced Compilers, Fall 2009

Notices

Undergrads: Shoot me an email if you need a permission code to sign up for the course.

Disclaimer: This syllabus is tentative and subject to change.

Prerequisites

Technically, the 5000-level compilers course is a pre-requisite. This is meant to be more of a maturity requirement, as this course will strive to be standalone.

Course summary and objective

Compilation transforms high-level programming languages into equivalent low-level machine instructions or virtual machine bytecodes. Modern high-level programming languages are full of powerful features, such as garbage collection, higher-order functions, continuations, macros, templates and advanced type systems. Compiling such languages efficiently and correctly is the central challenge addressed in this course.

At present, the intent is to focus on compilation of functional programming languages.

Topics

Syntactic sugar:
- Church encodings
Advanced preprocessors:
- m4 language
Advanced techniques in lexing:

Derivatives of regular expressions.

Advanced techniques in parsing:

Parser combinators.
Parsing expression grammars.
LL(*) parser generators.
LR parser generators.
Unrestricted CFG parsing.

Generative programming features:

Mixed-hygiene macros.
Templates.
Reflection.

Static semantics/type systems:

Polymorphic type systems.
Hindley-Milner type systems.
Linear and affine type systems.

Intermediate representations:

Administrative normal form (ANF).
Continuation-passing style (CPS).
Register-transfer languages (RTL).
Static single-assignment form (SSA).

Program transformations:

Closure conversion.
Lambda lifting.
Defunctionalization.

Data layout:

Frame layout.
Stackless computation.
Tagged words.

Static analysis:

Abstract interpretation.
Data-flow analysis.
Higher-order control-flow analysis

Optimization:

Constant propagation.
Constant folding.
Inlining.
Redundancy elimination.
Instruction ordering.
Null-check removal.
Bounds-check removal.

Instruction selection:

Maximal munch.
Dynamic programming.
Selecting for speed v. power.

Code generation:

Code obfusaction.
C as intermediate representation.

Sethi-Ullman.
Graph-coloring.

Garbage collection:

Stop-and-copy.
Parallel.
Real-time.

Virtual machines:

Just-in-time compilation.

Run-time systems.
...

Materials

Lecture notes will be updated weekly.

Slides are also available.

Any papers assigned for reading will be provided as pdfs.

Related blog entries

Recommended resources

Sussman and Steele's original lambda papers are a classic reference on how to compile functional languages efficiently.
Structure and Interpretation of Computer Programs is the classic textbook which explores concepts in computation through the lense of Scheme. For years, it was the introductory freshman CS book at MIT. Chapter 4 describes how to build evaluators and optimizing partial evaluators. (Free online.)
How to Design Programs, the Felleisen, Findler, Flatt and Krishnamurthi book on test-driven program development in Scheme. (Free online.)
Lisp in Small Pieces is a broad and deep reference on the compilation of the Lisp family of languages. While the emphasis is on Scheme, the techniques in this book work for advanced features of both functional and object-oriented languages.
The newly updated Compilers: Principles, Techniques and Tools is the reference book for compiling imperative languages like C.

Everyone in compilers owns this book or its predecessor.
Compiling with Continuations is a classic on compiling with continuation-passing style as a high-level IR, but it is definitely an advanced text:

Since continuation-passing style is undergoing a renaissance, this book is back in print. Even though it's old, it's still state-of-the-art, because that it was written at the zenith of the first continuation-passing style epoch. Many of the techniques in this book are now being rediscovered independently.
Scala seemlessly fuses object-oriented and functional programming, and it is a strictly superior replacement for Java. Programming in Scala was co-authored by the language's creator, Martin Odersky:

The book is a complete reference on the Scala language, and it's a practical tutorial for using Scala for real-world programming. Scala has a rich library and it's compatible with the JVM, which means it interacts cleanly with Java programs and the Java API.
The guide on C++ templates and C++ template meta-programming:

Grading

[This may change prior to the start of the course.]

Your final grade is the better of:

your average grade on the exams; or
your average grade for the projects.

plus any bonus points earned throughout the semester.

In other words, if you get an A average on the exams, then you get an A in the class, no matter how badly you did on the projects. Or, if you get a B average on the projects, but an F average on the exams, then you still get a B for the class.

Collaboration policy

For projects, share anything but source code. Sharing of benchmarks is encouraged.

Bonus

There are bonus opportunities to add percentage points to your final grade:

Write or significantly clean and expand a wikipedia article on a topic related to the class. (+2%)
Write a peer-review-quality report of a research paper. (+5%)
Implement and empirically test an analysis from an unpublished research paper. Contact me to gain access to an archive of eligible unpublished papers. [Warning: This will require mastery of virtually all course material.] (+100%)
Propose your own project. (+x%)
Any other bonus projects announced in class.

Let me know if any of these bonus opportunities sound interesting to you, and we'll define the requirements before you get started.

Late policy

Late assignments will have min(2^d - 1, 100) percentage points subtracted from the final grade of the assignment, where d is the number of days late.

Tentative schedule

This schedule is subject to change. Once a pace for the course is established, the rest of the schedule will be fleshed out.

Date

Topic

Assignments

Project 1: A lexer toolkit

[show]

Due: Tues 29 Sep 2009, 11:59:59pm

In this project, you will implement a lexer toolkit based on generalized regular state machines (GRSM). To make this project feasible, it is recommended that you use derivatives of regular expressions. You will use this toolkit to implement a lexer for S-Expressions.

See the course notes for a description of a GRSM and derivatives of regular expressions.

Your toolkit

Lexer-generators like lex accept rules of the following form:

if    pattern pat matches
while in state state
then  perform action action.

For example, in lex each rule has the form:

<state>  pat  { action }

Your lexer toolkit will need a data structure for encoding these kinds of rules. For example, in Java, you might use:

  class LexRule<A> {
   DA          pattern ;
   Action<A>   action ;
  }

  // DA = Deterministic automaton.
  class DA {
   public DAState initState() ;
  }

  class DAState {
   public boolean isAccepting() ;
   public DAState next(char c) ;
  }

  abstract class Action<A> {
   public A onMatch(String text, LazyList<Character> remainder) ;
  }

An indivdual state in the lexer is a collection of rules, and might be encoded like:

 class LexState <A> {
  // The rules. 
  protected Array<LexRule<A>> rules ;

  // Called if end of stream is reached without a match.
  protected A onMatchFail(String unmatchableText) ;

  public LexState<A>(...) { ... }

  // Executes the action of the rule with the longest match.
  public A match(LazyList<Character> input) {
    ...
  } 
 }

Next, you will need an embedding of regular expressions to represent patterns. You don't need to parse regular expressions from an input source. Instead, patterns will be encoded directly as the abstract syntax trees of regular expressions. To avoid tedium, this is a good place to embed a domain-specific language for regular expressions. At a minimum, you'll want a structure that looks like:

 abstract class RegExp {
  // All regexps need to be convertible to DA's for matching.
  public DA toDA() ;
 }

 // Empty set:
 class NullSet extends RegExp {
 }

 // Empty string:
 class Blank extends RegExp {
 }

 // Single character:
 class Char extends RegExp {
  public Char(char c) { ... } 
 }
 
 // r1 r2 ==> new Seq(r1,r2)
 class Seq extends RegExp {
  public Seq (RegExp first, RegExp second) { ... }
 }

 // r1 | r2 ==> new Alt(r1,r2)
 class Alt extends RegExp {
  public Seq (RegExp choice1, RegExp choice2) { ... }
 }

 // r* ==> new Rep(r)
 class Rep extends RegExp {
  public Rep (RegExp pat) { ... }
 }

Lexical specification for S-Expressions

S-Expressions have comments, whitespace, parentheses, quoting symbols, symbols, integers, characters, booleans and strings.

Comments begin with a semicolon ; and end in a newline, and should be ignored.
A character sequence starting with #! and ending with !# should also be ignored as a comment. These "hash-bang" comments are allowed to properly nest, so that #! foo #! bar !# baz !# is a comment.
Whitespace can be a newline, carriage return, space or tab character, and should be ignored.
A left parenthesis ( should be lexed as the token (left-parenthesis).
A right parenthesis ) should be lexed as the token (right-parenthesis).
A left bracket [ should be lexed as the token (left-bracket).
A right bracket ] should be lexed as the token (right-bracket).
A backtick quote ` should be lexed as (backtick).
A regular quote ' should be lexed as (tick).
A comma , should be lexed as (comma).
A unsplicing comma ,@ should be lexed as (comma-unsplice).
An integer (-)?[0-9]+ should be lexed as (integer integer-value)

A character has the form #\char-name, and should be lexed as (char char-name); char-name may be any character allowed in integers and symobols, and uses the following mapping otherwise:

character	char-name
`space`	`space`
`newline`	`newline`
tab	`tab`
`(`	`left-parenthesis`
`)`	`right-parenthesis`
`[`	`left-bracket`
`]`	`right-bracket`
`	`backtick`
`'`	`tick`
`"`	`doublequote`
`,`	`comma`

For example, the token #\" would be lexed as (char doublequote).

A string begins with " and ends with ". The double quote character may appear in a string by being escaped with \. The \ character may be escaped with itself. The sequence \n is interpreted as a newline character. Strings may be spread across multiple lines. A string should be lexed as (text (character-list)), where character-list should contain the characters inside the string based on zero or more repetitions of (char char-name). For example:
```
"foo
bar"
```
would lex into (text ((char f) (char o) (char o) (char newline) (char b) (char a) (char r))).
A boolean is either #t or #f, and should be lexed as (boolean boolean-value), where boolean-value is either true or false.
A symbol is any sequence of characters which does not qualify as another lexical item, and should be lexed as (symbol (character-list))

For this project, the lexer will just output a sequence of tokens to stdout, separated by newlines. For example:

  #!/bin/bash
  scheme $0 $@ #! run scheme      !#
  exit         #! quit the script !#
  !#

  (define (id x)
   ; Return the result.
    x)

  (id 3)

would produce:

  (left-parenthesis)
  (symbol ((char d) (char e) (char f) (char i) (char n) (char e)))
  (left-parenthesis)
  (symbol ((char i) (char d)))
  (symbol ((char x)))
  (right-parenthesis)
  (symbol ((char x)))
  (right-parenthesis)
  (left-parenthesis)
  (symbol ((char i) (char d)))
  (integer 3)
  (right-parenthesis)

Deliverables

A short README document describing the architecture of your lexer toolkit, including what works, what doesn't work, and what almost works. To cover your a**, include any reasonable assumptions you made about the project description.
Readable, well-commented code.
A script called run-sx.sh, so that when I cd to the directory containing run-sx.sh, and run sh run-sx.sh < filename.sx, it produces a sequence of tokens for S-Expressions.
Email me zip file that, when unzipped, produces the following directory structure:
```
    ./lastname-firstname/README
    ./lastname-firstname/run-sx.sh
```
If you could put your code in ./lastname-firstname/src/, I would appreciate it. I will confirm receipt of submissions within 24 hours (but probably much sooner than that).

Resources

Lazy-list-based streams in Scala.
Regular expressions to NFAs with an embedded language in Java.
Matching regular expressions with derivatives in Scheme.

A reference implementation is available. To run it:

scala -cp lib org.ucombinator.languages.lexing.SExpLexerExample \
   < filename.sx

Restrictions

You can't use a lexer generator for this.
If you share code, you fail the project. Keep in mind that I will be reading your code, and I have caught cheaters in the past.

Project 2: Evaluators, partial-evaluator and macros

[show]

Due: Thu 29 Oct 2009, 11:59:59pm

In this project, you will implement an evaluator for a small subset of the Scheme language extended with first-class macros.

The BNF grammar for the language is:

<program> ::= <body>

<body> ::= <def>^*  <exp>^* 

<def> ::= (define (<var> <var>^*) <body>)
       |  (define <var> <exp>)

<exp> ::= <literal>
       |  <variable>
       |  (if <exp> <exp> <exp>)
       |  (if <exp> <exp>)
       |  (and <exp>^*)
       |  (or <exp>^*)
       |  (cond <cond-clause>^*)
       |  (set! <var> <exp>)
       |  (lambda <formals> <body>)
       |  (let <bindings> <body>)
       |  (letrec <bindings> <body>)
       |  (begin <body>)
       |  (macro <exp>)
       |  (<operator> <operand>^*)
       |  (<keyword> <s-exp>^*)

<literal> ::= (quote <s-exp>) | '<s-exp>
           |  <boolean>
           |  <string>
           |  <var>

<s-exp> ::= <symbol>
         |  <boolean>
         |  <number>
         |  <string>
         |  (<s-exp>^*)

<cond-clause> ::= (<exp> <exp>⁺)
               |  (<exp>)
               |  (<exp> => <exp>)
               |  (else <exp>)

<formals> ::= (<var>^*)
             | <var>

<bindings> ::= ((<var> <exp>)^*)

<operator> ::= <exp>

<operand> ::= <exp>

The initial environment for the evaluator should contain bindings for the following symbols:

 apply + not display newline cons car cdr cadr caadr cadar
 cddr cdddr list null? pair? list? number? string? symbol? eq? = gensym

First-class macros

The language should behave like R5RS Scheme except for the (macro ...) construct, which creates first-class macros.

Macros in Scheme/Lisp-like languages extend the syntax of the language by re-writing expressions whose "head" is a macro and then performing evaluation.

For this project, the expression (macro e) is equivalent to (list 'macro e).

When the evaluator encounters an application point:

  (e s-exp₁ ... s-exp_n)

if the expression e evaluates to the s-expression (macro f), then the syntax of the application point is transformed into the result of f(s-exp₁, ..., s-exp_n), and then re-evaluated.

Resources

R5RS.
My testing file: tests.scm.
First-class macros in Arc.
A reference interpreter, compiled with Gambit-C 5.4.2: interp.c. To compile: gsc -exe interpreter.c. The interpreter wraps the input in a (begin ...) expression.
A denotational interpreter in Scala.
Chapter 4 of SICP contains source code for a meta-circular evaluator.

Advice

Use the same data structure to represent run-time values that you use to represent syntax.
Even if you don't know Scheme, I recommend doing this project in Scheme. (It's about 500 lines of well-written, well-abstracted Scheme.)
If you do use Scala, make heavy use of custom patterns.

Turn-in

Details to come.

Project 3: Compiling with continuations

[show]

Due: 10 Dec 2009, 11:59:59pm

In this project, you will implement a Scheme-to-CPS-to-C compiler.

Late policy

The usual late policy does not apply for this project: 1 day late = 10% off, 2 days late = 20% off, 3 days late = 40% off, 4 days late = 80% off.

Grading

The project will be self-graded with a random audit of a third of submissions. Once all projects are in, a test suite will be released.

Input language

The BNF grammar for the input language is:

<program> ::= <body>

<body> ::= <def>^*  <exp>^* 

<def> ::= (define (<var> <var>^*) <body>)
       |  (define <var> <exp>)

<exp> ::= <literal>
       |  <variable>
       |  (if <exp> <exp> <exp>)
       |  (if <exp> <exp>)
       |  (and <exp>^*)
       |  (or <exp>^*)
       |  (cond <cond-clause>^*)
       |  (set! <var> <exp>)
       |  (lambda (<var>^*) <body>)
       |  (let <bindings> <body>)
       |  (letrec <bindings> <body>)
       |  (begin <body>)
       |  (<operator> <operand>^*)

<literal> ::= (quote <symbol>) | '<symbol>
           |  <integer>
           |  <boolean>
           |  <string>
           |  <var>

<cond-clause> ::= (<exp> <exp>⁺)
               |  (<exp>)
               |  (<exp> => <exp>)
               |  (else <exp>)

<bindings> ::= ((<var> <exp>)^*)

<operator> ::= <exp>

<operand> ::= <exp>

You should support the following primitives:

 display newline + * - / = < call/cc

You do not have to support multi-precision arithmetic. 32-bit or 64-bit integers are OK.

Desugared language

It is recommended that you desugar into this language:

 
 <exp> ::= <literal>
        |  <variable>
        |  (if <exp> <exp> <exp>)
        |  (set! <var> <exp>)
        |  (lambda (<var>^*) <body>)
        |  (begin <body>)
        |  (<exp> <exp>^*)

CPS language

The final CPS language should be:

 
 <exp> ::= <literal>
        |  <variable>
        |  (lambda <formals> <call>)

 <call> ::= (if <exp> <call> <call>)
         |  (begin (set! <var> <exp>) <call>)
         |  (<exp> <exp>^*)

Passing the flag --output=cps to the compiler should produce this language.

It is worth noting that the output CPS language allows variable-arity lambdas, but the input language does not. (This is useful for desugaring begin expressions.)

Your output language can use the following primitives:

 display-cps newline-cps +-cps *-cps --cps /-cps =-cps <-cps halt

These primitives take the current continuation as their *last* argument.

C output

For the C output, you have two implementation choices: tail-calls and first-class labels. Tail calls are ANSI C; first-class labels are implementation-specific. I will only require that your code compiles with GCC.

You are encouraged to use a closure conversion technique similar to the one found in this blog post. Of course, when you closure convert in CPS, a label will do instead of a function pointer.

Recommendations

You can exploit a lot of existing code.
Feel free to use more than one implementation language.
If you get stuck, ask questions!
There are several ways to compile CPS to C.
Compile a few examples by hand before you write any code.

Resources

SICP.
HtDP.
R5RS.
A First-Order One-Pass CPS Transformation by Olivier Danvy and Lasse R. Nielsen.
On One-Pass CPS Transformations by Olivier Danvy, Kevin Millikin, and Lasse R. Nielsen.
Programming with continuations.
Compiling Scheme to C.