Compilers

Time Tues & Thurs, 10:45 AM-12:05 PM
Location WEB 2230
Instructor    Matt Might (MEB 3450)
Office hours   After lecture, by appointment, or whenever my door is open.

TAs Limou Wang (MEB 4152), Petey Aldous (MEB 3375)
Limou hours 2:00pm - 3:00pm (Mon, Wed), 1:00pm - 2:00pm (Fri)
Petey hours 9:30am - 10:30am (Mon, Wed), 12:30pm - 1:30pm (Mon, Wed)

Group Utah Compilers, Spring 2013
Blog blog.might.net (I'll post lecture notes here.)

Navigation

Description

The field of compilers concerns the automated translation of high-level programming languages into low-level code.

This course covers the design and implementation of compilers, and it explores related topics such as interpreters, virtual machines and runtime systems.

This is not an easy course, but it is a rewarding one for those who finish.

Few programmers will ever have to work on a production compiler, but knowing how compilers work is an asset.

It makes you a better programmer.

Knowing how a compiler thinks will change how you code. Once you know what compilers can and cannot optimize, you will target the intersection of human and mechanical comprehension.

Your code will look better. Your code will run better.

Many programming tasks can be framed in terms of interpretation or translation. Compilers teaches you how to solve tasks like these properly.

Many companies develop an internal programming or scripting language. Most of them get it wrong. If you've taken compilers, you can fix it.

Finally, if you want to develop your own programming language, take compilers in order to spare the world some accidental indecency.

Schedule

Grading

There are two ways to earn a grade in this class: tests and projects.

Your grade is your tests grade or your projects grade, whichever is higher.

(You do not have to take the final if you are satisfied with your projects.)

Exams

Exams will be impossible to finish in the alloted time.

You should answer those questions which you know best.

The weight of each question is proportional to the percentage of students that failed to answer it or answered it incorrectly.

The result will be curved.

Material

I'm morally opposed to the way textbooks are priced.

There is no required textbook for the class.

Instead of a textbook, I maintain a technical blog with articles on topics covered in the class.

I'll link to applicable online material from here, as well as the slides and any code we use in class.

Recommended books

Some students have requested recommended textbooks.

There is no "Python compiler project" textbook, but many books cover some of the necessary concepts.

A good resource on compiling a dynamic language like Python is Lisp in Small Pieces.

An older (yet remarkably up-to-date) resource on compiling with continuation passing style is Compiling with Continuations. (This is a very advanced book and not recommended for struggling students.)

The classic compiler textbook is "The Dragon Book": Compilers: Principles, Techniques and Tools. Some of the techniques in this book work for dynamic languages like Python. Unfortunately, most of the techniques only work well for C-like typed languages.

Relevant blog posts

Lambda calculus

Racket

Python

Lexical analysis

Parsing

Transformation

Collaboration policy

For the projects, you are allowed to pair with a partrner and submit a single, joint assignment.

You may also work alone.

I encourage the formation of study groups to discuss the theory and algorithms behind the projects.

You may not share code with anyone other than your partner.

If you want to learn it, you have to implement it.

Violators of the collaboration policy will be dealt with under the University's academic misconduct policy, and they will fail the course.

How to cheat in this class

I've caught so many clueless attempts at cheating in my courses that I feel the need to provide a guide to students.

If you're going to cheat, you need to rewrite the assignment from scratch.

Projects

This semester, we'll be building a Python 3.1 compiler.

You can use any language you want to do the implementation.

I strongly encourage learning and using a functional language such as Racket, Scala, OCaml/SML, Clojure or Haskell.

Anything else is a knife in a gunfight.

  1. Lexical analyzer
  2. Parser
  3. High-level translator
  4. Normalizer & CPS converter
  5. Low-level translator
  6. Register allocator & Assembly code emitter

Bonus opportunities

Valid until the week of finals.

Project 1: A lexer for Python

Due: Friday, Feb 1, 11:59 PM Utah time

Your task is to construct a lexical analyzer for the Python programming language.

Python is a relatively simple language to tokenize, with the notable exception of its significant whitespace.

For reference, the lexical specification for Python is available as part of the official documentation.

Guidelines

  1. You can use any language you want for this project, but you must be able to run your program under Linux or Mac OS X.
  2. If the TA or I can't get your code to run, we'll ask you to help us get it running.

Requirements

  1. Include a README.txt file that contains:
    1. your name;
    2. a list of software dependencies necessary to run your assignment;
    3. a description of what you got working, what's partially working and what's completely broken; and
    4. a manifest briefly describing each file you've turned.
  2. Every source code file you turn in needs to have your name and student number at the top and the bottom in comments.
  3. The program should use a Makefile:
    1. make should compile the program (if compilation is necessary).
    2. make run should accept input on STDIN and send output to STDOUT.
    3. make test should run any test cases you used.
  4. The program should output one s-expression per line for each token in the input.
  5. There are eight kinds of s-expressions you can emit:
    1. (NEWLINE) -- for a logical newline.
    2. (INDENT) -- for a logical increase in indentation.
    3. (DEDENT) -- for a logical decrease in indentation.
    4. (ID name) -- for an identifier, where name is a Racket-compatible literal string.
    5. (LIT value) -- for a literal value, where value is a Racket-compatible literal value. (If it's a string, it must be properly escaped.).
    6. (KEYWORD symbol) -- for an keyword, where symbol is a Racket-compatible literal symbol.
    7. (PUNCT text) -- for operators and delimiters, where text is a Racket string containing the text of the operator or delimiter.
    8. (ENDMARKER) -- for the end of the input.
  6. If you encounter a lexical error, print (ERROR "explanation") and quit.

Simplifications

  1. You do not have to handle multiple character-set encodings. Assume the input is ASCII. (This nullifies the need to handle regular strings and byte strings differently.) Assume an identifier matches the regular expression [A-Za-z_][A-Za-z_0-9]*
  2. You should not join adjacent strings together. That's easier to do with a later pass.
  3. If a tab appears as an indentation character, it is an error. (You do not have to bump it to the nearest multiple of 8.)

Hints and tips

  1. You can use a lexer-generator like flex or Racket's lexer for this project. (You'll find flex preinstalled on most Unix systems.) If you go this route, be sure to exploit the ^ regular expression pattern for matching the start of a line.
  2. You could also write your own lexical analyzer from scratch using existing regex support on with derivatives of regular expressions.
  3. It's not unreasonable to write a lexer by hand for a language like Python. If you use a language that supports regular expressions, you can get the longest matching prefix of a string for the regular expression re with /^(re)/. The special pattern ^ matches the start of the input (or in some cases, the start of a line).
  4. You may find it easier to write a two-pass lexical analyzer: the first pass could resolve signifcant whitespace, and the second could handle proper tokenization.
  5. For this project, using a functional language holds advantages over languages like C and Java, but it does not hold an advantage over a lexer generator like flex.

Example

The input:
# Comment 1
 # Comment 2

# Factorial:

def fact(  x\
):

  if x == -1:
    return 1.j

  elif x ==0:

    return 1
  else:

        return x* fact(x

- 1)

s = "foo\
\\ \n\'\""


fact(20)
should produce:
(KEYWORD def)
(ID "fact")
(PUNCT "(")
(ID "x")
(PUNCT ")")
(PUNCT ":")
(NEWLINE)
(INDENT)
(KEYWORD if)
(ID "x")
(PUNCT "==")
(PUNCT "-")
(LIT 1)
(PUNCT ":")
(NEWLINE)
(INDENT)
(KEYWORD return)
(LIT +1.i)
(NEWLINE)
(DEDENT)
(KEYWORD elif)
(ID "x")
(PUNCT "==")
(LIT 0)
(PUNCT ":")
(NEWLINE)
(INDENT)
(KEYWORD return)
(LIT 1)
(NEWLINE)
(DEDENT)
(KEYWORD else)
(PUNCT ":")
(NEWLINE)
(INDENT)
(KEYWORD return)
(ID "x")
(PUNCT "*")
(ID "fact")
(PUNCT "(")
(ID "x")
(PUNCT "-")
(LIT 1)
(PUNCT ")")
(NEWLINE)
(DEDENT)
(DEDENT)
(ID "s")
(PUNCT "=")
(LIT "foo\\ \n\'\"")
(NEWLINE)
(ID "fact")
(PUNCT "(")
(LIT 20)
(PUNCT ")")
(NEWLINE)
(ENDMARKER)
    

Code

You may find the sample lexers written in flex to be useful.

Reference implementation

A reference implementation is available online: pylex.

Submission instructions

Copy your project to /home/unid/pylex/ on caprica.ucombinator.org.

It should be possible to run your code with:

    cd /home/unid/pylex
    make 
    make run < test1.py > test1.py.out
    make run < test2.py > test2.py.out
    make run < test3.py > test3.py.out
    ...
   

Project 2: A parser for Python

Due: Wed, Feb 27 11:59 PM Utah time

Your task is to construct a parser for the Python programming language.

Unusually among programming languages, Python is straightforward to parse. In fact, it is one of the few popular programming languages with an LL(1) grammar, making it amenable to hand-written recursive descent parser.

Of course, it is also possible to use a parsing tool.

Guidelines

  1. You can use any language you want for this project, but you must be able to run your program under Linux or Mac OS X.
  2. If the TA or I can't get your code to run, we'll ask you to help us get it running.

Requirements

  1. You must parse the simplified Python grammar.
  2. You should assume that Python source is coming in -- not pre-lexed input.
  3. If you need to use a working lexer, you can assume that the program pylex is in the current PATH, and that it matches the behavior of the web app by consuming a file on STDIN and printing its tokens as S-Expressions line-by-line on STDOUT.
  4. The output S-Expression should fit within grammar for the concrete AST.
  5. Include a README.txt file that contains:
    1. your name;
    2. a list of software dependencies necessary to run your assignment;
    3. a description of what you got working, what's partially working and what's completely broken; and
    4. a manifest briefly describing each file you've turned.
  6. Every source code file you turn in needs to have your name and student number at the top and the bottom in comments.
  7. The program should use a Makefile:
    1. make should compile the program (if compilation is necessary).
    2. make run should accept input on STDIN and send output to STDOUT.
    3. make lex should run the lexer on STDIN.
    4. make parse should run the parser on STDIN.
    5. make test should run any test cases you used.
  8. The program should output one s-expression containing the concrete parse tree on success or #f on failure.

Simplifications

  1. Use the simplified grammar provided.

Hints and tips

  1. Do not write the parser by hand, unless you have a lot of time.
  2. I recommend that you use yacc, Racket's parser generator a derivative-based parsing tool.
  3. Check out the documentation for derp.
  4. If you want write your lexer and parser in a different language, you can tie them together with a Unix pipe, like so:
         $ pylex | pyparse
        
  5. You can complete the entire assignment by only modifying the file python-ast.grm.sx file provided with the stub code. You just add derp reductions to the rules in this file.
  6. Add a -n flag to pylex that suppresses the ENDMARKER token. To target subsets, change file_input in pyparse.rkt to the name of the production you want to target.

Code

A reference web app is available.

You may find the sample parsers useful.

If you take the Racket route, stub code is available for parsing with derivatives.

Submission instructions

Copy your project to /home/unid/pyparse/ on caprica.ucombinator.org.

It should be possible to run your code with:

    cd /home/unid/pyparse
    make 
    make run < test1.py > test1.py.out
    make run < test2.py > test2.py.out
    make run < test3.py > test3.py.out
    pylex < test1.py | make parse > test1.py.out
    pylex < test2.py | make parse > test2.py.out
    pylex < test3.py | make parse > test3.py.out
    ...
   

Project 3: A high-level translator for Python

Due: Wed March 27 11:59 PM Utah time

Your task is to construct a translator from the subset of Python from Project 2 into a high-level intermediate representation (HIR).

The high-level IR shares many features with Python, including dynamic typing and higher-order procedures.

To focus on key concepts, some additional parts of Python have been removed for this project.

Guidelines

  1. You can use any language you want for this project, but you must be able to run your program on the class server. (Non-functional languages are ill-advised unless you have copious free time.)
  2. You will have chance to slightly modify your assignment after submission to comply with the auto-grader.

Requirements

  1. You must translate the same simplified subset of Python from the previous project, but assume it is in the output format of project 2.
  2. If you need to use a working lexer, you can assume that the program pylex is in the current PATH, and that it matches the behavior of the web app by consuming a file on STDIN and printing its tokens as S-Expressions line-by-line on STDOUT.
  3. If you need to use a working parser, you can assume that the program pyparse is in the current PATH, and that it matches the behavior of the reference app by consuming a tokenized input on STDIN and printing the AST on stdout.
  4. The output should fit within grammar for HIR.
  5. Include a README.txt file that contains:
    1. your name;
    2. a list of software dependencies necessary to run your assignment;
    3. a description of what you got working, what's partially working and what's completely broken; and
    4. a manifest briefly describing each file you've turned in.
  6. Every source code file you turn in needs to have your name and student number at the top and the bottom in comments.
  7. The program should use a Makefile:
    1. make should compile the program (if compilation is necessary).
    2. make trans should accept the output of project 2 on STDIN and send output to STDOUT.
    3. make run should accept a Python on STDIN and send output to STDOUT.
    4. make test should run any test cases you used.
  8. The program should output one S-expression containing the HIR code on success or print an error message containing the word error on failure.
  9. Your output need not match the reference implementation exactly. It should be semantically equivalent (unless the reference implementation has a bug).
  10. You can also test your output against a Python 3.x interpreter, but note that these may differ in how they display objects using print.

Simplifications

  1. Since project 2 snipped out the class system, you can't create objects, but you should still assume that you can access and set their fields with .name. (Note: The code to interpret HIR fails immediately with an error message if these constructs are invoked.)
  2. For numbers, assume only integers are used in the source and at run-time. (No floating point, no imaginary, no complex.)
  3. All try blocks have exactly one except clause, and no else or finally clauses, i.e.:
               try:
                 main code
               except:
                 failure code
             
  4. A raise can raise any value (not just exceptions).
  5. The single except clause for a try block does not specify which exception it catches, nor does it name it.
  6. You cannot delete a variable from the local scope. You can only delete fields and indices. That is, you can write:
              del a[i] 
    or
              del a.f 
    but not:
              del x 
    Of course, you could also write:
              del f(3)["foo"][i] 
  7. You can assume all global variables are defined at the top level before their first reference. If you want to handle global variables first defined locally using the global construct, talk to Matt. That is, you have to handle:
           y = 10
           def f(x):
              global y
              y = x
           
    but not:
           def f(x):
              global y
              y = x
           

Hints and tips

  1. In this project, the expressive advantage of using a functional language like Racket or Haskell over a language like C or Java starts to become substantial. Those wishing to complete this project in a non-functional language need to start immediately and ask for help early!
  2. A prominent feature in functional languages is the ability to pattern-match on data structures. This ability is the killer advantage in this project.
  3. If you want write your lexer, parser and translator in different languages, you can tie them together with a Unix pipe, like so:
         $ pylex | pyparse | ./pytrans
        
  4. Break the transformation down into a transformer for each kind of Python term: let transform-program transform an entire Python program into HIR; let transform-stmt transform an individual Python stmt into HIR; let transform-exp transform an individual Python expression into HIR.
  5. Use only very small programs for testing (initially). Start with print(3) and work up from there. Examine the output of the reference implementation on these small programs to unravel its logic.

Code

This code is provided without warranty or guarantee of correctness. If you find a discrepancy between this code and the project description, notify Matt.

A reference web application is available.

To execute HIR code, you can prepend hir-header.rkt to the result of your translation, and then execute it in racket. (hir-header.rkt provides macros and functions to define all of elements of HIR in terms of Racket.)

If you want, you can start the project with stub code; this file illustrates a common design pattern for compilers--a transformation function for each class of term. In this case, it contains transform-program, transform-stmt and transform-exp.

You can grab a reference implementation from pytrans_rkt.zo. It's compiled Racket bytecode for Racket 5.3.1. To run, use racket pytrans_rkt.zo and send input on STDIN.

Submission instructions

Copy your project to /home/unid/pytrans/ on caprica.ucombinator.org.

It should be possible to run your code with:

    cd /home/unid/pytrans
    make 
    make run < test1.py > test1.py.out
    make run < test2.py > test2.py.out
    make run < test3.py > test3.py.out
    pylex < test1.py | pyparse | make trans > test1.py.out
    pylex < test2.py | pyparse | make trans > test2.py.out
    pylex < test3.py | pyparse | make trans > test3.py.out
    ...
   

Project 4: A CPS translator for Python

Due: Wed April 10 11:59 PM Utah time

Your task is to construct a translator from the subset of Python in Project 3 into continuation-passing style (CPS).

Of course, the expected way of accomplishing this is to translate HIR into CPS.

It is tedious to translate HIR directly to CPS. It is strongly recommended that you push HIR into a desugaring phase--dropping it down to LIR first.

Guidelines

  1. You can use any language you want for this project, but you must be able to run your program under Linux or Mac OS X. Once again, non-functional languages are at a steep expressiveness disadvantage, and will require substantially more time and code invested. Please plan and request help accordingly.
  2. If the TA or I can't get your code to run, we'll ask you to help us get it running.

Requirements

  1. You must translate the same simplified subset of Python from the Project 2 less the additional restrictions from Project 3.
  2. You should assume that Python source is coming in, not pre-lexed, pre-parsed input or pre-translated input.
  3. If you need to use a working lexer, you can assume that the program pylex is in the current PATH, and that it matches the behavior of the web app by consuming a file on STDIN and printing its tokens as S-Expressions line-by-line on STDOUT.
  4. If you need to use a working parser, you can assume that the program pyparse is in the current PATH, and that it matches the behavior of the web app by consuming a tokenized input on STDIN and printing the AST on stdout.
  5. If you need to use a working HIR translator, you can assume that the program pytrans is in the current PATH, and that it matches the behavior of the bytecode app by a Python AST on STDIN and printing the HIR AST on stdout.
  6. The output should fit within the grammar for CPS.
  7. Include a README.txt file that contains:
    1. your name;
    2. a list of software dependencies necessary to run your assignment;
    3. a description of what you got working, what's partially working and what's completely broken; and
    4. a manifest briefly (a few words) describing each file you've turned.
  8. Every source code file you turn in needs to have your name and student number at the top and the bottom in comments.
  9. The program should use a Makefile:
    1. make should compile the program (if compilation is necessary).
    2. make run should accept Python code on STDIN and send CPS code to STDOUT.
    3. make test should run any test cases you used.
  10. The program should output one S-expression containing the CPS code. The output for invalid input is undefined, but it is recommended that you print an error message containing the word error on failure.
  11. Your output need not match the reference implementation exactly. It should be semantically equivalent (unless the reference implementation has a bug).
  12. When CPS-converting, the continuation argument always comes last, even for primitive operations.
  13. You can also test your output against a Python 3.x interpreter.

Simplifications

  1. All object- and field-related operations have been removed from the language.

Hints and tips

  1. This project is optimally split into two phases: desugaring and CPS conversion.
  2. It is strongly recommended that you first translate HIR into LIR, and then translate LIR to CPS.
  3. Even if you don't understand the define-syntax form in detail, you may learn a great deal about how to desugar from hir-header.rkt. This is particularly true for the desugaring of try.
  4. A substantial harness is provided in the stub code. Even if you don't plan to use Racket, it is recommended that you look at the harness. In particular, try:
         $ make test-cps
         $ make test-lir
          
  5. The stub code contains a desugarer with all of the cases stubbed out. It also contains the "top level" for the reference CPS converter.
  6. Pattern-matching tree transforms once again dominate this project.

Reading material

Of the reading material for the course, the following are most helpful:

Code

This code is provided without warranty or guarantee of correctness. If you find a discrepancy between this code an the project description, notify Matt.

To execute LIR code, you can prepend lir-header.rkt to the result of your translation, and then execute it in racket.

To execute CPS code, you can prepend cps-header.rkt to the result of your translation, and then execute it in racket; for example, you could do:

     $ make run < myfile.py > myfile.cps
     $ cat cps-header.rkt myfile.cps > myfile.rkt
     $ racket myfile.rkt
    

You can grab a reference desugarer from pydesugar_rkt.zo.

You can grab a reference cps converter from pycps_rkt.zo.

These are compiled bytecode for Racket 5.3

Submission instructions

Copy your project to /home/unid/pycps/ on caprica.ucombinator.org.

It should be possible to run your code with:

    cd /home/unid/pycps
    make 
    make run < test1.py > test1.py.cps
    make run < test2.py > test2.py.cps
    make run < test3.py > test3.py.cps
    pylex < test1.py | pyparse | pytrans | make cps > test1.py.cps
    pylex < test2.py | pyparse | pytrans | make cps > test2.py.cps
    pylex < test3.py | pyparse | pytrans | make cps > test3.py.cps
    ...
   

Project 5: A Python-to-C compiler

Due: Wed 24 April 11:59 PM Utah time

Your task is to construct a translator from the subset of Python in Project 4 into C.

Of course, the expected way of accomplishing this is to translate from CPS.

To make this transformation as easy as possible, it is recommended that you perform mutable variable elmination, flat closure conversion and lambda lifting. If you do, then the result may be more easily transliterated into C.

In terms of restructuring the language for these transforms, it is recommended that you follow this spec.

Guidelines

  1. You can use any language you want for this project, but you must be able to run your program under Linux or Mac OS X. Once again, non-functional languages are at a steep expressiveness disadvantage, and will require substantially more time and code invested. Please plan and request help accordingly.
  2. If the TA or I can't get your code to run, we'll ask you to help us get it running.

Requirements

  1. You must translate the same simplified subset of Python from the Project 2 less the additional restrictions from Projects 3 and 4.
  2. You should assume that Python source is coming in, not pre-lexed, pre-parsed input or pre-translated input.
  3. If you need to use a working lexer, you can assume that the program pylex is in the current PATH, and that it matches the behavior of the web app by consuming a file on STDIN and printing its tokens as S-Expressions line-by-line on STDOUT.
  4. If you need to use a working parser, you can assume that the program pyparse is in the current PATH, and that it matches the behavior of the web app by consuming a tokenized input on STDIN and printing the AST on stdout.
  5. If you need to use a working HIR translator, you can assume that the program pytrans is in the current PATH, and that it matches the behavior of the bytecode app by a Python AST on STDIN and printing the HIR AST on stdout.
  6. If you need to use a working LIR translator, you can assume that the program pydesugar is in the current PATH, and that it matches the behavior of the bytecode app by a Python AST on STDIN and printing the LIR AST on stdout.
  7. If you need to use a working CPS converter, you can assume that the program pycps is in the current PATH, and that it matches the behavior of the bytecode app by a Python AST on STDIN and printing the LIR AST on stdout.
  8. The output should be accepted by gcc.
  9. Include a README.txt file that contains:
    1. your name;
    2. a list of software dependencies necessary to run your assignment;
    3. a description of what you got working, what's partially working and what's completely broken; and
    4. a manifest briefly (a few words) describing each file you've turned.
  10. Every source code file you turn in needs to have your name and student number at the top and the bottom in comments.
  11. The program should use a Makefile:
    1. make should compile the program (if compilation is necessary).
    2. make run should accept Python code on STDIN and send C code to STDOUT.
    3. make test should run any test cases you used.
  12. The output for invalid input is undefined, but it is recommended that you print an error message containing the word error on failure.
  13. Your output need not match the reference implementation exactly. It should be semantically equivalent (unless the reference implementation has a bug).
  14. You can also test your output against a Python 3.x interpreter.

Hints and tips

  1. This project is optimally split into four small transforms: mutable variable elimination, closure conversion, lambda-lifting and C code emission.
  2. Tree transforms once again dominate this project.

Code

This code is provided without warranty or guarantee of correctness. If you find a discrepancy between this code an the project description, notify Matt.

To execute MVE'd, closue-converted or lambda-lifted code, you can prepend c-header.rkt to the result of your translation, and then execute it in racket.

You may find pytoc-common.rkt useful for writing these transforms.

You can grab a reference MVE from pymve_rkt.zo.

You can grab a reference closure-converter from pycc_rkt.zo.

You can grab a reference lambda-lifter from pyll_rkt.zo.

You can grab a mostly-completed C translator from pytoc.rkt.

These are compiled bytecode for Racket 5.3

You can find tests cases here.

Submission instructions

Place the project on caprica in ~/pytoc/.