Advanced Topics in Compilation

Instructor Matt Might
TA Thomas Gilray
Catalog num. CS 6475, 16441
Time Mon. and Wed. 11:50 AM-01:10 PM
Office hours After class or by appointment.
Or, set up an appointment by email.
TA office hours Tuesday/Thursday, noon - 2pm (MEB 3375)
Location WEB L126
Google group Join here!
Slides pdf
Live code live code

Notices

Disclaimer: This syllabus is tentative and subject to change.

Prerequisites

Students should be comfortable with theoretically challenging programming projects.

Students must be comfortable with discrete mathematics.

Course summary and objective

Compilation transforms high-level programming languages into equivalent low-level machine instructions or virtual machine bytecodes.

Modern high-level programming languages are full of powerful features, such as garbage collection, higher-order functions, continuations, macros, templates and advanced type systems.

Compiling such languages efficiently and correctly is the central challenge addressed in this course.

At present, the intent is to focus on compilation of functional programming languages.

Topics

  • Syntactic sugar:
    • Church encodings
  • Advanced techniques in lexing:
    • Derivatives of regular expressions.
  • Advanced techniques in parsing:
    • Parser combinators.
    • Derivatives of parser combinators.
    • Parsing expression grammars.
    • LL(*) parser generators.
    • LR parser generators.
    • Unrestricted CFG parsing.
  • Generative programming features:
    • Mixed-hygiene macros.
    • C++ Template meta-programming.
    • Reflection.
  • Static semantics/type systems:
    • Monomorphic type systems.
    • Polymorphic type systems.
    • Hindley-Milner type systems.
    • Linear and affine type systems.
    • Dependent type systems.
  • Intermediate representations:
    • Administrative normal form (ANF).
    • Continuation-passing style (CPS).
    • Register-transfer languages (RTL).
    • Static single-assignment form (SSA).
  • Advanced interpreter architectures:
    • Reduction interpreters.
    • Denotional interpreters.
    • CESK machines.
  • Program transformations:
    • Closure conversion.
    • Lambda lifting.
    • Defunctionalization.
  • Data layout:
    • Frame layout.
    • Stackless computation.
    • Tagged words.
  • Static analysis:
    • Abstract interpretation.
    • Data-flow analysis.
    • Higher-order control-flow analysis
  • Optimization:
    • Constant propagation.
    • Constant folding.
    • Inlining.
    • Redundancy elimination.
    • Instruction ordering.
    • Null-check removal.
    • Bounds-check removal.
  • Instruction selection:
    • Maximal munch.
    • Dynamic programming.
    • Selecting for speed v. power.
  • Code generation:
    • Code obfusaction.
    • C as intermediate representation.
  • Register allocation:
    • Sethi-Ullman.
    • Graph-coloring.
  • Garbage collection:
    • Stop-and-copy.
    • Parallel.
    • Real-time.
  • Virtual machines:
    • Just-in-time compilation.
  • Run-time systems.

Materials

Any papers assigned for reading will be provided as pdfs.

Most of the course material will be available in blog posts.

Recommended resources

  • Sussman and Steele's original lambda papers are a classic reference on how to compile functional languages efficiently.
  • Structure and Interpretation of Computer Programs is the classic textbook which explores concepts in computation through the lense of Scheme. For years, it was the introductory freshman CS book at MIT. Chapter 4 describes how to build evaluators and optimizing partial evaluators. (Free online.)
  • How to Design Programs, the Felleisen, Findler, Flatt and Krishnamurthi book on test-driven program development in Scheme. (Free online.)
  • Lisp in Small Pieces is a broad and deep reference on the compilation of the Lisp family of languages. While the emphasis is on Scheme, the techniques in this book work for advanced features of both functional and object-oriented languages.
  • The newly updated Compilers: Principles, Techniques and Tools is the reference book for compiling imperative languages like C.
    Everyone in compilers owns this book or its predecessor.
  • Compiling with Continuations is a classic on compiling with continuation-passing style as a high-level IR, but it is definitely an advanced text:
    Since continuation-passing style is undergoing a renaissance, this book is back in print. Even though it's old, it's still state-of-the-art, because that it was written at the zenith of the first continuation-passing style epoch. Many of the techniques in this book are now being rediscovered independently.
  • Scala seemlessly fuses object-oriented and functional programming, and it is a strictly superior replacement for Java. Programming in Scala was co-authored by the language's creator, Martin Odersky:
    The book is a complete reference on the Scala language, and it's a practical tutorial for using Scala for real-world programming. Scala has a rich library and it's compatible with the JVM, which means it interacts cleanly with Java programs and the Java API.
  • The guide on C++ templates and C++ template meta-programming:

Grading

[This may change prior to the start of the course.]

Your final grade is the better of:

  • your average grade on the exams; or
  • your average grade for the projects.
plus any bonus points earned throughout the semester.

In other words, if you get an A average on the exams, then you get an A in the class, no matter how badly you did on the projects. Or, if you get a B average on the projects, but an F average on the exams, then you still get a B for the class.

Collaboration policy

For projects, share anything but source code. Sharing of benchmarks is encouraged.

Bonus

There are bonus opportunities to add percentage points to your final grade:

  • Write or significantly clean and expand a wikipedia article on a topic related to the class. (+2%)
  • Write a peer-review-quality report of a research paper. (+5%)
  • Implement and empirically test an analysis from an unpublished research paper. Contact me to gain access to an archive of eligible unpublished papers. [Warning: This will require mastery of virtually all course material.] (+100%)
  • Propose your own project. (+x%)
  • Any other bonus projects announced in class.
Let me know if any of these bonus opportunities sound interesting to you, and we'll define the requirements before you get started.

Late policy

Late assignments will not be accepted without requesting an extension.

Tentative schedule

This schedule is subject to change. Once a pace for the course is established, the rest of the schedule will be fleshed out.

Date Topic

Assignments

These assignments are from the previous version of the course, and are subject to change. The new assignments will be similar in difficulty.

Project 0: A compiler from Scheme to lambda

  • Due: Tuesday, 01 Oct 2013, 6:00pm Utah time

In this project, you will write a compiler from a rich subset of Scheme into the pure lambda calculus. You must use Scala.

In the process, you will review many core techniques in compilation: lexing, parsing, abstract syntax trees and tree transformation.

Your input language is the the grammar from this post on compiling up to the lambda calculus:

 <exp> ::= <var>

        |  #t
        |  #f
        |  (if  <exp> <exp> <exp>)
        |  (and <exp> <exp>)
        |  (or  <exp> <exp>)

        |  <nat>
        |  (zero? <exp>)
        |  (- <exp> <exp>)
        |  (= <exp> <exp>)
        |  (+ <exp> <exp>)
        |  (* <exp> <exp>)

        |  <lam>
        |  (let ((<var> <exp>) ...) <exp>)
        |  (letrec ((<var> <lam>)) <exp>)

        |  (cons <exp> <exp>)
        |  (car  <exp>)
        |  (cdr  <exp>)
        |  (pair? <exp>)
        |  (null? <exp>)
        |  (quote ())

        |  (<exp> <exp> ...)

 <lam> ::= (λ (<var> ...) <exp>)

Your output language is the lambda calculus:

 <exp> ::= <var>
        |  (λ (<var>) <exp>)
        |  (<exp> <exp>)

You must use the Church encodings from the aforementioned post, encapsulated in unchurch.rkt.

That is, your code should run in Racket.

It should be possible to run your compiler with:

 $ make run < input.scm > output.lc

If an entire transformed program returns a natural, wrapping it in (natify output) should evaluate to the right number when run in Racket.

You may wish to use the parser from this post.

Turnin

Place your assignment in a tarball so that it expands into a directory with the name of your UID, e.g., u0631296.

The Makefile should be in uNNNNNN/Makefile.

Mail a copy of your project to the TA and Matt with the subject: "CS 6475: Assignment 0"

Project 1: A lexer toolkit

[show]
  • Due: Oct 29 2013, 5:00pm Utah time

In this project, you will implement a lexer toolkit based on generalized regular state machines (GRSM). To make this project feasible, it is recommended that you use derivatives of regular expressions. You will use this toolkit to implement a lexer for S-Expressions.

See the course notes for a description of a GRSM and derivatives of regular expressions.

Your toolkit

Lexer-generators like lex accept rules of the following form:

if    pattern pat matches
while in state state
then  perform action action.
 

For example, in lex each rule has the form:
<state>  pat  { action }
 

Your lexer toolkit will need a data structure for encoding these kinds of rules. For example, in Java, you might use:

  class LexRule<A> {
   DA          pattern ;
   Action<A>   action ;
  }

  // DA = Deterministic automaton.
  class DA {
   public DAState initState() ;
  }

  class DAState {
   public boolean isAccepting() ;
   public DAState next(char c) ;
  }

  abstract class Action<A> {
   public A onMatch(String text, LazyList<Character> remainder) ;
  }
 
An indivdual state in the lexer is a collection of rules, and might be encoded like:
 class LexState <A> {
  // The rules. 
  protected Array<LexRule<A>> rules ;

  // Called if end of stream is reached without a match.
  protected A onMatchFail(String unmatchableText) ;

  public LexState<A>(...) { ... }

  // Executes the action of the rule with the longest match.
  public A match(LazyList<Character> input) {
    ...
  } 
 } 
 
Next, you will need an embedding of regular expressions to represent patterns. You don't need to parse regular expressions from an input source. Instead, patterns will be encoded directly as the abstract syntax trees of regular expressions. To avoid tedium, this is a good place to embed a domain-specific language for regular expressions. At a minimum, you'll want a structure that looks like:
 abstract class RegExp {
  // All regexps need to be convertible to DA's for matching.
  public DA toDA() ;
 }

 // Empty set:
 class NullSet extends RegExp {
 }

 // Empty string:
 class Blank extends RegExp {
 }

 // Single character:
 class Char extends RegExp {
  public Char(char c) { ... } 
 }
 
 // r1 r2 ==> new Seq(r1,r2)
 class Seq extends RegExp {
  public Seq (RegExp first, RegExp second) { ... }
 }

 // r1 | r2 ==> new Alt(r1,r2)
 class Alt extends RegExp {
  public Seq (RegExp choice1, RegExp choice2) { ... }
 }

 // r* ==> new Rep(r)
 class Rep extends RegExp {
  public Rep (RegExp pat) { ... }
 }
 

Lexical specification for S-Expressions

S-Expressions have comments, whitespace, parentheses, quoting symbols, symbols, integers, characters, booleans and strings.

  • Comments begin with a semicolon ; and end in a newline, and should be ignored.
  • A character sequence starting with #! and ending with !# should also be ignored as a comment. These "hash-bang" comments are allowed to properly nest, so that #! foo #! bar !# baz !# is a comment.
  • Whitespace can be a newline, carriage return, space or tab character, and should be ignored.
  • A left parenthesis ( should be lexed as the token (left-parenthesis).
  • A right parenthesis ) should be lexed as the token (right-parenthesis).
  • A left bracket [ should be lexed as the token (left-bracket).
  • A right bracket ] should be lexed as the token (right-bracket).
  • A backtick quote ` should be lexed as (backtick).
  • A regular quote ' should be lexed as (tick).
  • A comma , should be lexed as (comma).
  • A unsplicing comma ,@ should be lexed as (comma-unsplice).
  • An integer (-)?[0-9]+ should be lexed as (integer integer-value)
  • A character has the form #\char-name, and should be lexed as (char char-name); char-name may be any character allowed in integers and symobols, and uses the following mapping otherwise:
    characterchar-name
    space space
    newline newline
    tab tab
    ( left-parenthesis
    ) right-parenthesis
    [ left-bracket
    ] right-bracket
    ` backtick
    ' tick
    " doublequote
    , comma
    For example, the token #\" would be lexed as (char doublequote).
  • A string begins with " and ends with ". The double quote character may appear in a string by being escaped with \. The \ character may be escaped with itself. The sequence \n is interpreted as a newline character. Strings may be spread across multiple lines. A string should be lexed as (text (character-list)), where character-list should contain the characters inside the string based on zero or more repetitions of (char char-name). For example:
    "foo
    bar"
    would lex into (text ((char f) (char o) (char o) (char newline) (char b) (char a) (char r))).
  • A boolean is either #t or #f, and should be lexed as (boolean boolean-value), where boolean-value is either true or false.
  • A symbol matches [-A-Za-z0-9+/*_?%$&^=!@<>:]+, and should be lexed as (symbol (character-list))
For this project, the lexer will just output a sequence of tokens to stdout, separated by newlines. For example:
  #!/bin/bash
  scheme $0 $@ #! run scheme      !#
  exit         #! quit the script !#
  !#

  (define (id x)
   ; Return the result.
    x)

  (id 3)
  
would produce:
  (left-parenthesis)
  (symbol ((char d) (char e) (char f) (char i) (char n) (char e)))
  (left-parenthesis)
  (symbol ((char i) (char d)))
  (symbol ((char x)))
  (right-parenthesis)
  (symbol ((char x)))
  (right-parenthesis)
  (left-parenthesis)
  (symbol ((char i) (char d)))
  (integer 3)
  (right-parenthesis)
  

Deliverables

  • A short README document describing the architecture of your lexer toolkit, including what works, what doesn't work, and what almost works. To cover your a**, include any reasonable assumptions you made about the project description.
  • Readable, well-commented code.
  • A script called run-sx.sh, so that when I cd to the directory containing run-sx.sh, and run sh run-sx.sh < filename.sx, it produces a sequence of tokens for S-Expressions.
  • Email me and the TA a zip file that, when unzipped, produces the following directory structure:
        ./lastname-firstname/README
        ./lastname-firstname/run-sx.sh
    If you could put your code in ./lastname-firstname/src/, I would appreciate it. The TA will confirm receipt of submissions within 24 hours (but probably much sooner than that). Please use the subject CS 6475: Project 1 Submission.

Resources

Restrictions

  • You can't use a lexer generator for this.
  • If you share code, you fail the project. Keep in mind that I will be reading your code, and I have caught cheaters in the past.

Project 2: A small-step static analyzer

[show]
  • Due: 12 Dec 2013, 5:00pm

In this project, you will create a static analyzer for a simple higher-order language. Specifically, you will implement the small-step formulation of Shivers's original 0CFA.

The language for which you will construct the analyzer is continuation-passing style:

 <aexp> ::= (lambda (<var> ...) <cexp>)
         |  <var>
         |  halt

 <cexp> ::=  (<aexp> <aexp> ... ) 
 

You should be able to run the analyzer with a shell script 0cfa like so:

 $ ./0cfa program variable
 

This should perform 0CFA on program and then report the lambda terms which may flow to the variable variable, one per line.

For example, given the program:

 ((lambda (f) (f f)
  (lambda (g) (g g))))
 

Running ./0cfa omega.lc f should produce:

 (lambda (g) (g g))
 

while running ./0cfa omega.lc g should produce:

 (lambda (g) (g g)) 
 

Read more on 0CFA.

Hint: In 0CFA, the allocation function should return the variable name as the address.

Project 3: Evaluators, partial-evaluator and macros

[show]
  • Due: Thu 29 Oct 2013, 11:59:59pm

In this project, you will implement an evaluator for a small subset of the Scheme language extended with first-class macros.

The BNF grammar for the language is:

<program> ::= <body>

<body> ::= <def>*  <exp>* 

<def> ::= (define (<var> <var>*) <body>)
       |  (define <var> <exp>)

<exp> ::= <literal>
       |  <variable>
       |  (if <exp> <exp> <exp>)
       |  (if <exp> <exp>)
       |  (and <exp>*)
       |  (or <exp>*)
       |  (cond <cond-clause>*)
       |  (set! <var> <exp>)
       |  (lambda <formals> <body>)
       |  (let <bindings> <body>)
       |  (letrec <bindings> <body>)
       |  (begin <body>)
       |  (macro <exp>)
       |  (<operator> <operand>*)
       |  (<keyword> <s-exp>*)

<literal> ::= (quote <s-exp>) | '<s-exp>
           |  <boolean>
           |  <string>
           |  <var>

<s-exp> ::= <symbol>
         |  <boolean>
         |  <number>
         |  <string>
         |  (<s-exp>*)

<cond-clause> ::= (<exp> <exp>+)
               |  (<exp>)
               |  (<exp> => <exp>)
               |  (else <exp>)

<formals> ::= (<var>*)
             | <var>

<bindings> ::= ((<var> <exp>)*)

<operator> ::= <exp>

<operand> ::= <exp>

The initial environment for the evaluator should contain bindings for the following symbols:

 apply + not display newline cons car cdr cadr caadr cadar
 cddr cdddr list null? pair? list? number? string? symbol? eq? = gensym

First-class macros

The language should behave like R5RS Scheme except for the (macro ...) construct, which creates first-class macros.

Macros in Scheme/Lisp-like languages extend the syntax of the language by re-writing expressions whose "head" is a macro and then performing evaluation.

For this project, the expression (macro e) is equivalent to (list 'macro e).

When the evaluator encounters an application point:

  (e s-exp1 ... s-expn)
if the expression e evaluates to the s-expression (macro f), then the syntax of the application point is transformed into the result of f(s-exp1, ..., s-expn), and then re-evaluated.

Resources

Advice

  • Use the same data structure to represent run-time values that you use to represent syntax.
  • Even if you don't know Scheme, I recommend doing this project in Scheme. (It's about 500 lines of well-written, well-abstracted Scheme.)
  • If you do use Scala, make heavy use of custom patterns.

Turn-in

Details to come.

Project 4: Compiling with continuations

[show]
  • Due: 10 Dec 2013, 11:59:59pm

In this project, you will implement a Scheme-to-CPS-to-C compiler.

Late policy

The usual late policy does not apply for this project: 1 day late = 10% off, 2 days late = 20% off, 3 days late = 40% off, 4 days late = 80% off.

Grading

The project will be self-graded with a random audit of a third of submissions. Once all projects are in, a test suite will be released.

Input language

The BNF grammar for the input language is:

<program> ::= <body>

<body> ::= <def>*  <exp>* 

<def> ::= (define (<var> <var>*) <body>)
       |  (define <var> <exp>)

<exp> ::= <literal>
       |  <variable>
       |  (if <exp> <exp> <exp>)
       |  (if <exp> <exp>)
       |  (and <exp>*)
       |  (or <exp>*)
       |  (cond <cond-clause>*)
       |  (set! <var> <exp>)
       |  (lambda (<var>*) <body>)
       |  (let <bindings> <body>)
       |  (letrec <bindings> <body>)
       |  (begin <body>)
       |  (<operator> <operand>*)

<literal> ::= (quote <symbol>) | '<symbol>
           |  <integer>
           |  <boolean>
           |  <string>
           |  <var>

<cond-clause> ::= (<exp> <exp>+)
               |  (<exp>)
               |  (<exp> => <exp>)
               |  (else <exp>)

<bindings> ::= ((<var> <exp>)*)

<operator> ::= <exp>

<operand> ::= <exp>

You should support the following primitives:

 display newline + * - / = < call/cc

You do not have to support multi-precision arithmetic. 32-bit or 64-bit integers are OK.

Desugared language

It is recommended that you desugar into this language:

 
 <exp> ::= <literal>
        |  <variable>
        |  (if <exp> <exp> <exp>)
        |  (set! <var> <exp>)
        |  (lambda (<var>*) <body>)
        |  (begin <body>)
        |  (<exp> <exp>*)

CPS language

The final CPS language should be:

 
 <exp> ::= <literal>
        |  <variable>
        |  (lambda <formals> <call>)

 <call> ::= (if <exp> <call> <call>)
         |  (begin (set! <var> <exp>) <call>)
         |  (<exp> <exp>*)

Passing the flag --output=cps to the compiler should produce this language.

It is worth noting that the output CPS language allows variable-arity lambdas, but the input language does not. (This is useful for desugaring begin expressions.)

Your output language can use the following primitives:

 display-cps newline-cps +-cps *-cps --cps /-cps =-cps <-cps halt

These primitives take the current continuation as their *last* argument.

C output

For the C output, you have two implementation choices: tail-calls and first-class labels. Tail calls are ANSI C; first-class labels are implementation-specific. I will only require that your code compiles with GCC.

You are encouraged to use a closure conversion technique similar to the one found in this blog post. Of course, when you closure convert in CPS, a label will do instead of a function pointer.

Recommendations

  • You can exploit a lot of existing code.
  • Feel free to use more than one implementation language.
  • If you get stuck, ask questions!
  • There are several ways to compile CPS to C.
  • Compile a few examples by hand before you write any code.

Resources