Advanced Topics in Compilation
Instructor | Matt Might |
TA | Thomas Gilray |
Catalog num. | CS 6475, 16441 |
Time | Mon. and Wed. 11:50 AM-01:10 PM |
Office hours |
After class or by appointment.
Or, set up an appointment by email. |
TA office hours | Tuesday/Thursday, noon - 2pm (MEB 3375) |
Location | WEB L126 |
Google group | Join here! |
Slides | |
Live code | live code |
Notices
Disclaimer: This syllabus is tentative and subject to change.
Prerequisites
Students should be comfortable with theoretically challenging programming projects.
Students must be comfortable with discrete mathematics.
Course summary and objective
Compilation transforms high-level programming languages into equivalent low-level machine instructions or virtual machine bytecodes.
Modern high-level programming languages are full of powerful features, such as garbage collection, higher-order functions, continuations, macros, templates and advanced type systems.
Compiling such languages efficiently and correctly is the central challenge addressed in this course.
At present, the intent is to focus on compilation of functional programming languages.
Topics
-
Syntactic sugar:
- Church encodings
- Advanced techniques in lexing:
- Derivatives of regular expressions.
- Advanced techniques in parsing:
- Parser combinators.
- Derivatives of parser combinators.
- Parsing expression grammars.
- LL(*) parser generators.
- LR parser generators.
- Unrestricted CFG parsing.
- Generative programming features:
- Mixed-hygiene macros.
- C++ Template meta-programming.
- Reflection.
- Static semantics/type systems:
- Monomorphic type systems.
- Polymorphic type systems.
- Hindley-Milner type systems.
- Linear and affine type systems.
- Dependent type systems.
- Intermediate representations:
- Administrative normal form (ANF).
- Continuation-passing style (CPS).
- Register-transfer languages (RTL).
- Static single-assignment form (SSA).
- Advanced interpreter architectures:
- Reduction interpreters.
- Denotional interpreters.
- CESK machines.
- Program transformations:
- Closure conversion.
- Lambda lifting.
- Defunctionalization.
- Data layout:
- Frame layout.
- Stackless computation.
- Tagged words.
- Static analysis:
- Abstract interpretation.
- Data-flow analysis.
- Higher-order control-flow analysis
- Optimization:
- Constant propagation.
- Constant folding.
- Inlining.
- Redundancy elimination.
- Instruction ordering.
- Null-check removal.
- Bounds-check removal.
- Instruction selection:
- Maximal munch.
- Dynamic programming.
- Selecting for speed v. power.
- Code generation:
- Code obfusaction.
- C as intermediate representation.
- Register allocation:
- Sethi-Ullman.
- Graph-coloring.
- Garbage collection:
- Stop-and-copy.
- Parallel.
- Real-time.
- Virtual machines:
- Just-in-time compilation.
- Run-time systems.
Materials
Any papers assigned for reading will be provided as pdfs.
Most of the course material will be available in blog posts.
Recommended resources
- Sussman and Steele's original lambda papers are a classic reference on how to compile functional languages efficiently.
- Structure and Interpretation of Computer Programs is the classic textbook which explores concepts in computation through the lense of Scheme. For years, it was the introductory freshman CS book at MIT. Chapter 4 describes how to build evaluators and optimizing partial evaluators. (Free online.)
- How to Design Programs, the Felleisen, Findler, Flatt and Krishnamurthi book on test-driven program development in Scheme. (Free online.)
- Lisp in Small Pieces is a broad and deep reference on the compilation of the Lisp family of languages. While the emphasis is on Scheme, the techniques in this book work for advanced features of both functional and object-oriented languages.
- The newly updated Compilers: Principles, Techniques and Tools is the reference book for compiling imperative languages like C. Everyone in compilers owns this book or its predecessor.
- Compiling with Continuations is a classic on compiling with continuation-passing style as a high-level IR, but it is definitely an advanced text: Since continuation-passing style is undergoing a renaissance, this book is back in print. Even though it's old, it's still state-of-the-art, because that it was written at the zenith of the first continuation-passing style epoch. Many of the techniques in this book are now being rediscovered independently.
- Scala seemlessly fuses object-oriented and functional programming, and it is a strictly superior replacement for Java. Programming in Scala was co-authored by the language's creator, Martin Odersky: The book is a complete reference on the Scala language, and it's a practical tutorial for using Scala for real-world programming. Scala has a rich library and it's compatible with the JVM, which means it interacts cleanly with Java programs and the Java API.
- The guide on C++ templates and C++ template meta-programming:
Grading
[This may change prior to the start of the course.]
Your final grade is the better of:
- your average grade on the exams; or
- your average grade for the projects.
In other words, if you get an A average on the exams, then you get an A in the class, no matter how badly you did on the projects. Or, if you get a B average on the projects, but an F average on the exams, then you still get a B for the class.
Collaboration policy
For projects, share anything but source code. Sharing of benchmarks is encouraged.
Bonus
There are bonus opportunities to add percentage points to your final grade:
- Write or significantly clean and expand a wikipedia article on a topic related to the class. (+2%)
- Write a peer-review-quality report of a research paper. (+5%)
- Implement and empirically test an analysis from an unpublished research paper. Contact me to gain access to an archive of eligible unpublished papers. [Warning: This will require mastery of virtually all course material.] (+100%)
- Propose your own project. (+x%)
- Any other bonus projects announced in class.
Late policy
Late assignments will not be accepted without requesting an extension.
Tentative schedule
This schedule is subject to change. Once a pace for the course is established, the rest of the schedule will be fleshed out.
Date | Topic |
Assignments
These assignments are from the previous version of the course, and are subject to change. The new assignments will be similar in difficulty.
Project 0: A compiler from Scheme to lambda
- Due: Tuesday, 01 Oct 2013, 6:00pm Utah time
In this project, you will write a compiler from a rich subset of Scheme into the pure lambda calculus. You must use Scala.
In the process, you will review many core techniques in compilation: lexing, parsing, abstract syntax trees and tree transformation.
Your input language is the the grammar from this post on compiling up to the lambda calculus:
<exp> ::= <var> | #t | #f | (if <exp> <exp> <exp>) | (and <exp> <exp>) | (or <exp> <exp>) | <nat> | (zero? <exp>) | (- <exp> <exp>) | (= <exp> <exp>) | (+ <exp> <exp>) | (* <exp> <exp>) | <lam> | (let ((<var> <exp>) ...) <exp>) | (letrec ((<var> <lam>)) <exp>) | (cons <exp> <exp>) | (car <exp>) | (cdr <exp>) | (pair? <exp>) | (null? <exp>) | (quote ()) | (<exp> <exp> ...) <lam> ::= (λ (<var> ...) <exp>)
Your output language is the lambda calculus:
<exp> ::= <var> | (λ (<var>) <exp>) | (<exp> <exp>)
You must use the Church encodings from the aforementioned post, encapsulated in unchurch.rkt.
That is, your code should run in Racket.
It should be possible to run your compiler with:
$ make run < input.scm > output.lc
If an entire transformed program returns a natural, wrapping it in
(natify output)
should evaluate to
the right number when run in Racket.
You may wish to use the parser from this post.
Turnin
Place your assignment in a tarball so that it expands into a directory
with the name of your UID, e.g., u0631296
.
The Makefile
should be in uNNNNNN/Makefile
.
Mail a copy of your project to the TA and Matt with the subject: "CS 6475: Assignment 0"
Project 1: A lexer toolkit
- Due: Oct 29 2013, 5:00pm Utah time
In this project, you will implement a lexer toolkit based on generalized regular state machines (GRSM). To make this project feasible, it is recommended that you use derivatives of regular expressions. You will use this toolkit to implement a lexer for S-Expressions.
See the course notes for a description of a GRSM and derivatives of regular expressions.
Your toolkit
Lexer-generators like lex accept rules of the following form:
if pattern pat matches while in state state then perform action action.For example, in lex each rule has the form:
<state> pat { action }
Your lexer toolkit will need a data structure for encoding these kinds of rules. For example, in Java, you might use:
class LexRule<A> { DA pattern ; Action<A> action ; } // DA = Deterministic automaton. class DA { public DAState initState() ; } class DAState { public boolean isAccepting() ; public DAState next(char c) ; } abstract class Action<A> { public A onMatch(String text, LazyList<Character> remainder) ; }An indivdual state in the lexer is a collection of rules, and might be encoded like:
class LexState <A> { // The rules. protected Array<LexRule<A>> rules ; // Called if end of stream is reached without a match. protected A onMatchFail(String unmatchableText) ; public LexState<A>(...) { ... } // Executes the action of the rule with the longest match. public A match(LazyList<Character> input) { ... } }Next, you will need an embedding of regular expressions to represent patterns. You don't need to parse regular expressions from an input source. Instead, patterns will be encoded directly as the abstract syntax trees of regular expressions. To avoid tedium, this is a good place to embed a domain-specific language for regular expressions. At a minimum, you'll want a structure that looks like:
abstract class RegExp { // All regexps need to be convertible to DA's for matching. public DA toDA() ; } // Empty set: class NullSet extends RegExp { } // Empty string: class Blank extends RegExp { } // Single character: class Char extends RegExp { public Char(char c) { ... } } // r1 r2 ==> new Seq(r1,r2) class Seq extends RegExp { public Seq (RegExp first, RegExp second) { ... } } // r1 | r2 ==> new Alt(r1,r2) class Alt extends RegExp { public Seq (RegExp choice1, RegExp choice2) { ... } } // r* ==> new Rep(r) class Rep extends RegExp { public Rep (RegExp pat) { ... } }
Lexical specification for S-Expressions
S-Expressions have comments, whitespace, parentheses, quoting symbols, symbols, integers, characters, booleans and strings.
- Comments begin with a semicolon
;
and end in a newline, and should be ignored. - A character sequence starting with
#!
and ending with!#
should also be ignored as a comment. These "hash-bang" comments are allowed to properly nest, so that#! foo #! bar !# baz !#
is a comment. - Whitespace can be a newline, carriage return, space or tab character, and should be ignored.
- A left parenthesis
(
should be lexed as the token(left-parenthesis)
. - A right parenthesis
)
should be lexed as the token(right-parenthesis)
. - A left bracket
[
should be lexed as the token(left-bracket)
. - A right bracket
]
should be lexed as the token(right-bracket)
. - A backtick quote
`
should be lexed as(backtick)
. - A regular quote
'
should be lexed as(tick)
. - A comma
,
should be lexed as(comma)
. - A unsplicing comma
,@
should be lexed as(comma-unsplice)
. - An integer
(-)?[0-9]+
should be lexed as(integer integer-value)
- A character has the form
#\char-name
, and should be lexed as(char char-name)
; char-name may be any character allowed in integers and symobols, and uses the following mapping otherwise:character char-name space
space
newline
newline
tab tab
(
left-parenthesis
)
right-parenthesis
[
left-bracket
]
right-bracket
`
backtick
'
tick
"
doublequote
,
comma
#\"
would be lexed as(char doublequote)
. - A string begins with
"
and ends with"
. The double quote character may appear in a string by being escaped with\
. The\
character may be escaped with itself. The sequence\n
is interpreted as a newline character. Strings may be spread across multiple lines. A string should be lexed as(text (character-list))
, where character-list should contain the characters inside the string based on zero or more repetitions of(char char-name)
. For example:"foo bar"
would lex into(text ((char f) (char o) (char o) (char newline) (char b) (char a) (char r)))
. -
A boolean is either
#t
or#f
, and should be lexed as(boolean boolean-value)
, where boolean-value is eithertrue
orfalse
. -
A symbol matches
[-A-Za-z0-9+/*_?%$&^=!@<>:]+
, and should be lexed as(symbol (character-list))
#!/bin/bash scheme $0 $@ #! run scheme !# exit #! quit the script !# !# (define (id x) ; Return the result. x) (id 3)would produce:
(left-parenthesis) (symbol ((char d) (char e) (char f) (char i) (char n) (char e))) (left-parenthesis) (symbol ((char i) (char d))) (symbol ((char x))) (right-parenthesis) (symbol ((char x))) (right-parenthesis) (left-parenthesis) (symbol ((char i) (char d))) (integer 3) (right-parenthesis)
Deliverables
- A short
README
document describing the architecture of your lexer toolkit, including what works, what doesn't work, and what almost works. To cover your a**, include any reasonable assumptions you made about the project description. - Readable, well-commented code.
- A script called
run-sx.sh
, so that when Icd
to the directory containingrun-sx.sh
, and runsh run-sx.sh < filename.sx
, it produces a sequence of tokens for S-Expressions. -
Email me and the TA a zip file that, when unzipped, produces the following directory structure:
./lastname-firstname/README ./lastname-firstname/run-sx.sh
If you could put your code in./lastname-firstname/src/
, I would appreciate it. The TA will confirm receipt of submissions within 24 hours (but probably much sooner than that). Please use the subject CS 6475: Project 1 Submission.
Resources
- Lazy-list-based streams in Scala.
- Regular expressions to NFAs with an embedded language in Java.
- Matching regular expressions with derivatives in Scheme.
- A reference implementation is available. To run it:
scala -cp bin SXLexer \ < filename.sx
Restrictions
- You can't use a lexer generator for this.
- If you share code, you fail the project. Keep in mind that I will be reading your code, and I have caught cheaters in the past.
Project 2: A small-step static analyzer
- Due: 12 Dec 2013, 5:00pm
In this project, you will create a static analyzer for a simple higher-order language. Specifically, you will implement the small-step formulation of Shivers's original 0CFA.
The language for which you will construct the analyzer is continuation-passing style:
<aexp> ::= (lambda (<var> ...) <cexp>) | <var> | halt <cexp> ::= (<aexp> <aexp> ... )
You should be able to run the analyzer with a shell script 0cfa
like so:
$ ./0cfa program variable
This should perform 0CFA on program and then report the lambda terms which may flow to the variable variable, one per line.
For example, given the program:
((lambda (f) (f f) (lambda (g) (g g))))
Running ./0cfa omega.lc f
should produce:
(lambda (g) (g g))
while running ./0cfa omega.lc g
should produce:
(lambda (g) (g g))
Hint: In 0CFA, the allocation function should return the variable name as the address.
Project 3: Evaluators, partial-evaluator and macros
- Due: Thu 29 Oct 2013, 11:59:59pm
In this project, you will implement an evaluator for a small subset of the Scheme language extended with first-class macros.
The BNF grammar for the language is:
<program> ::= <body> <body> ::= <def>* <exp>* <def> ::= (define (<var> <var>*) <body>) | (define <var> <exp>) <exp> ::= <literal> | <variable> | (if <exp> <exp> <exp>) | (if <exp> <exp>) | (and <exp>*) | (or <exp>*) | (cond <cond-clause>*) | (set! <var> <exp>) | (lambda <formals> <body>) | (let <bindings> <body>) | (letrec <bindings> <body>) | (begin <body>) | (macro <exp>) | (<operator> <operand>*) | (<keyword> <s-exp>*) <literal> ::= (quote <s-exp>) | '<s-exp> | <boolean> | <string> | <var> <s-exp> ::= <symbol> | <boolean> | <number> | <string> | (<s-exp>*) <cond-clause> ::= (<exp> <exp>+) | (<exp>) | (<exp> => <exp>) | (else <exp>) <formals> ::= (<var>*) | <var> <bindings> ::= ((<var> <exp>)*) <operator> ::= <exp> <operand> ::= <exp>
The initial environment for the evaluator should contain bindings for the following symbols:
apply + not display newline cons car cdr cadr caadr cadar cddr cdddr list null? pair? list? number? string? symbol? eq? = gensym
First-class macros
The language should behave like R5RS Scheme except for the
(macro ...)
construct, which creates first-class macros.
Macros in Scheme/Lisp-like languages extend the syntax of the language by re-writing expressions whose "head" is a macro and then performing evaluation.
For this project, the expression (macro e)
is
equivalent to (list 'macro e)
.
When the evaluator encounters an application point:
(e s-exp1 ... s-expn)if the expression e evaluates to the s-expression
(macro f)
, then
the syntax of the application point is transformed into the result of f(s-exp1, ..., s-expn), and then re-evaluated.
Resources
- R5RS.
- My testing file: tests.scm.
- First-class macros in Arc.
-
A reference interpreter, compiled with Gambit-C 5.4.2: interp.c.
To compile:
gsc -exe interpreter.c
. The interpreter wraps the input in a(begin ...)
expression. - A denotational interpreter in Scala.
- Chapter 4 of SICP contains source code for a meta-circular evaluator.
Advice
- Use the same data structure to represent run-time values that you use to represent syntax.
- Even if you don't know Scheme, I recommend doing this project in Scheme. (It's about 500 lines of well-written, well-abstracted Scheme.)
- If you do use Scala, make heavy use of custom patterns.
Turn-in
Details to come.
Project 4: Compiling with continuations
- Due: 10 Dec 2013, 11:59:59pm
In this project, you will implement a Scheme-to-CPS-to-C compiler.
Late policy
The usual late policy does not apply for this project: 1 day late = 10% off, 2 days late = 20% off, 3 days late = 40% off, 4 days late = 80% off.
Grading
The project will be self-graded with a random audit of a third of submissions. Once all projects are in, a test suite will be released.
Input language
The BNF grammar for the input language is:
<program> ::= <body> <body> ::= <def>* <exp>* <def> ::= (define (<var> <var>*) <body>) | (define <var> <exp>) <exp> ::= <literal> | <variable> | (if <exp> <exp> <exp>) | (if <exp> <exp>) | (and <exp>*) | (or <exp>*) | (cond <cond-clause>*) | (set! <var> <exp>) | (lambda (<var>*) <body>) | (let <bindings> <body>) | (letrec <bindings> <body>) | (begin <body>) | (<operator> <operand>*) <literal> ::= (quote <symbol>) | '<symbol> | <integer> | <boolean> | <string> | <var> <cond-clause> ::= (<exp> <exp>+) | (<exp>) | (<exp> => <exp>) | (else <exp>) <bindings> ::= ((<var> <exp>)*) <operator> ::= <exp> <operand> ::= <exp>
You should support the following primitives:
display newline + * - / = < call/cc
You do not have to support multi-precision arithmetic. 32-bit or 64-bit integers are OK.
Desugared language
It is recommended that you desugar into this language:
<exp> ::= <literal> | <variable> | (if <exp> <exp> <exp>) | (set! <var> <exp>) | (lambda (<var>*) <body>) | (begin <body>) | (<exp> <exp>*)
CPS language
The final CPS language should be:
<exp> ::= <literal> | <variable> | (lambda <formals> <call>) <call> ::= (if <exp> <call> <call>) | (begin (set! <var> <exp>) <call>) | (<exp> <exp>*)
Passing the flag --output=cps
to the compiler should produce this language.
It is worth noting that the output CPS language allows
variable-arity lambdas, but the input language does not.
(This is useful for desugaring begin
expressions.)
Your output language can use the following primitives:
display-cps newline-cps +-cps *-cps --cps /-cps =-cps <-cps halt
These primitives take the current continuation as their *last* argument.
C output
For the C output, you have two implementation choices: tail-calls and first-class labels. Tail calls are ANSI C; first-class labels are implementation-specific. I will only require that your code compiles with GCC.
You are encouraged to use a closure conversion technique similar to the one found in this blog post. Of course, when you closure convert in CPS, a label will do instead of a function pointer.
Recommendations
- You can exploit a lot of existing code.
- Feel free to use more than one implementation language.
- If you get stuck, ask questions!
- There are several ways to compile CPS to C.
- Compile a few examples by hand before you write any code.
Resources
- SICP.
- HtDP.
- R5RS.
- A First-Order One-Pass CPS Transformation by Olivier Danvy and Lasse R. Nielsen.
- On One-Pass CPS Transformations by Olivier Danvy, Kevin Millikin, and Lasse R. Nielsen.
- Programming with continuations.
- Compiling Scheme to C.