Compilers

Instructor    Matt Might
TA Youngrok Bahn
Time Tues & Thurs, 2:00pm - 3:20pm
Location MEB 3105
Appts. Office hours after each lecture, or when my door is open.
Group Utah Compilers, Spring 2011
Blog blog.might.net (I'll post lecture notes here.)

Navigation

Description

The field of compilers concerns the automated translation of high-level programming languages into low-level code.

This course covers the design and implementation of compilers, and it explores related topics such as interpreters, virtual machines and runtime systems.

This is not an easy course, but it is a rewarding one for those who finish.

Few programmers will ever have to work on a production compiler, but knowing how compilers work is an asset. It makes you a better programmer.

Knowing how a compiler thinks will change how you code. Once you know what compilers can and cannot optimize, you will target the intersection of human and mechanical comprehension.

Your code will look better.

Many programming tasks can be framed in terms of interpretation or translation. Compilers teaches you how to solve tasks like these properly.

Many companies develop an internal programming or scripting language. Most of them get it wrong. If you've taken compilers, you can fix it.

Finally, if you want to develop your own programming language, take compilers and spare the world some accidental indecency.

Schedule

Grading

Your final grade is the maximum of your average on the projects and your grade on the final.

You do not have to take the final if you are satisfied with your grade.

Final exam

The final exam will be impossible to finish in the alloted time.

You should answer those questions which you know best.

The weight of each question is proportional to the percentage of students that failed to answer it or answered it incorrectly.

The result will be curved.

Material

I'm morally opposed to the way textbooks are priced.

There is no required textbook for the class.

I'll link to applicable online material from here, as well as my notes, the slides and any code we use in class.

Preliminaries

Lambda calculus

Racket

Python

Lexical analysis

Parsing

Transformation

Collaboration policy

I encourage the formation of study groups to discuss the theory and algorithms behind the projects.

You may not share code.

If you want to learn it, you have to implement it.

Violators of this rule will be dealth with under the University's academic misconduct policy.

Projects

This semester, we'll be building a Python 3.1 compiler.

You can use any language you want to do the implementation.

I strongly encourage learning and using a functional language such as Racket, Scala, OCaml/SML, Clojure or Haskell.

Anything else is a knife in a gunfight.

  1. Lexical analyzer
  2. Parser
  3. High-level translator
  4. Normalizer & CPS converter
  5. Low-level translator
  6. Register allocator & Assembly code emitter

Bonus opportunities

Valid until the week of finals.

Project 1: A lexer for Python

Due: Sunday, Feb 6th, 11:59 PM Utah time

Your task is to construct a lexical analyzer for the Python programming language.

Python is a relatively simple language to tokenize, with the notable exception of its significant whitespace.

For reference, the lexical specification for Python is available as part of the official documentation.

Guidelines

  1. You can use any language you want for this project, but you must be able to run your program under Linux or Mac OS X.
  2. If the TA or I can't get your code to run, we'll ask you to help us get it running.

Requirements

  1. Include a README.txt file that contains:
    1. your name;
    2. a list of software dependencies necessary to run your assignment;
    3. a description of what you got working, what's partially working and what's completely broken; and
    4. a manifest briefly describing each file you've turned.
  2. Every source code file you turn in needs to have your name and student number at the top and the bottom in comments.
  3. The program should use a Makefile:
    1. make should compile the program (if compilation is necessary).
    2. make run should accept input on STDIN and send output to STDOUT.
    3. make test should run any test cases you used.
  4. The program should output one s-expression per line for each token in the input.
  5. There are eight kinds of s-expressions you can emit:
    1. (NEWLINE) -- for a logical newline.
    2. (INDENT) -- for a logical increase in indentation.
    3. (DEDENT) -- for a logical decrease in indentation.
    4. (ID name) -- for an identifier, where name is a Racket-compatible literal string.
    5. (LIT value) -- for a literal value, where value is a Racket-compatible literal value. (If it's a string, it must be properly escaped.).
    6. (KEYWORD symbol) -- for an keyword, where symbol is a Racket-compatible literal symbol.
    7. (PUNCT text) -- for operators and delimiters, where text is a Racket string containing the text of the operator or delimiter.
    8. (ENDMARKER) -- for the end of the input.
  6. If you encounter a lexical error, print (ERROR "explanation") and quit.

Simplifications

  1. You do not have to handle multiple character-set encodings. Assume the input is ASCII. (This nullifies the need to handle regular strings and byte strings differently.) Assume an identifier matches the regular expression [A-Za-z_][A-Za-z_0-9]*
  2. You should not join adjacent strings together. That's easier to do with a later pass.
  3. If a tab appears as an indentation character, it is an error. (You do not have to bump it to the nearest multiple of 8.)

Hints and tips

  1. You can use a lexer-generator like flex for this project. (You'll find flex preinstalled on most Unix systems.) If you go this route, be sure to exploit the ^ regular expression pattern for matching the start of a line.
  2. It's not unreasonable to write a lexer by hand for a language like Python. If you use a language that supports regular expressions, you can get the longest matching prefix of a string for the regular expression re with /^(re)/. The special pattern ^ matches the start of the input (or in some cases, the start of a line).
  3. You may find it easier to write a two-pass lexical analyzer.
  4. For this project, using a functional language holds advantages over languages like C and Java, but it does not hold an advantage over a lexer generator like flex.

Example

The input:
# Comment 1
 # Comment 2

# Factorial:

def fact(  x\
):

  if x == -1:
    return 1.j

  elif x ==0:

    return 1
  else:

        return x* fact(x

- 1)

s = "foo\
\\ \n\'\""


fact(20)
should produce:
(KEYWORD def)
(ID "fact")
(PUNCT "(")
(ID "x")
(PUNCT ")")
(PUNCT ":")
(NEWLINE)
(INDENT)
(KEYWORD if)
(ID "x")
(PUNCT "==")
(PUNCT "-")
(LIT 1)
(PUNCT ":")
(NEWLINE)
(INDENT)
(KEYWORD return)
(LIT +1.i)
(NEWLINE)
(DEDENT)
(KEYWORD elif)
(ID "x")
(PUNCT "==")
(LIT 0)
(PUNCT ":")
(NEWLINE)
(INDENT)
(KEYWORD return)
(LIT 1)
(NEWLINE)
(DEDENT)
(KEYWORD else)
(PUNCT ":")
(NEWLINE)
(INDENT)
(KEYWORD return)
(ID "x")
(PUNCT "*")
(ID "fact")
(PUNCT "(")
(ID "x")
(PUNCT "-")
(LIT 1)
(PUNCT ")")
(NEWLINE)
(DEDENT)
(DEDENT)
(ID "s")
(PUNCT "=")
(LIT "foo\\ \n\'\"")
(NEWLINE)
(ID "fact")
(PUNCT "(")
(LIT 20)
(PUNCT ")")
(NEWLINE)
(ENDMARKER)
    

Code

You may find the sample lexers written in flex to be useful.

Submission instructions

Upload here.

Project 2: A parser for Python

Due: March 14th, 11:59 PM Utah time

Your task is to construct a parser for the Python programming language.

Unusually among programming languages, Python is straightforward to parse. In fact, it is one of the few popular programming languages with an LL(1) grammar, making it amenable to hand-written recursive descent parser.

Of course, it is also possible to use a parsing tool.

Guidelines

  1. You can use any language you want for this project, but you must be able to run your program under Linux or Mac OS X.
  2. If the TA or I can't get your code to run, we'll ask you to help us get it running.

Requirements

  1. You must parse the simplified Python grammar.
  2. You should assume that Python source is coming in -- not pre-lexed input.
  3. If you need to use a working lexer, you can assume that the program pylex is in the current PATH, and that it matches the behavior of the web app by consuming a file on STDIN and printing its tokens as S-Expressions line-by-line on STDOUT.
  4. The output S-Expression should fit within grammar for the concrete AST.
  5. Include a README.txt file that contains:
    1. your name;
    2. a list of software dependencies necessary to run your assignment;
    3. a description of what you got working, what's partially working and what's completely broken; and
    4. a manifest briefly describing each file you've turned.
  6. Every source code file you turn in needs to have your name and student number at the top and the bottom in comments.
  7. The program should use a Makefile:
    1. make should compile the program (if compilation is necessary).
    2. make run should accept input on STDIN and send output to STDOUT.
    3. make lex should run the lexer on STDIN.
    4. make parse should run the parser on STDIN.
    5. make test should run any test cases you used.
  8. The program should output one s-expression containing the concrete parse tree on success or #f on failure.

Simplifications

  1. Use the simplified grammar provided.

Hints and tips

  1. Do not write the parser by hand, unless you have a lot of time.
  2. I recommend that you use yacc, Racket's parser generator a derivative-based parsing tool.
  3. If you want write your lexer and parser in a different language, you can tie them together with a Unix pipe, like so:
         $ pylex | pyparse
        

Code

A reference web app is available.

You may find the sample parsers useful.

If you take the Racket route, stub code is available for parsing with derivatives.

Submission instructions

Upload here.

Project 3: A high-level translator for Python

Due: March 28th, 11:59 PM Utah time (+ 1 week)

Your task is to construct a translator from the subset of Python from Project 2 into a high-level intermediate representation (HIR).

The high-level IR shares many features with Python, including dynamic typing and higher-order procedures.

To focus on key concepts, some additional parts of Python have been removed for this project.

Guidelines

  1. You can use any language you want for this project, but you must be able to run your program under Linux or Mac OS X. (Non-functional languages are ill-advised unless you have copious free time.)
  2. If the TA or I can't get your code to run, we'll ask you to help us get it running.

Requirements

  1. You must translate the same simplified subset of Python from the previous project.
  2. You should assume that Python source is coming in -- not pre-lexed input or pre-parsed input.
  3. If you need to use a working lexer, you can assume that the program pylex is in the current PATH, and that it matches the behavior of the web app by consuming a file on STDIN and printing its tokens as S-Expressions line-by-line on STDOUT.
  4. If you need to use a working parser, you can assume that the program pyparse is in the current PATH, and that it matches the behavior of the web app by consuming a tokenized input on STDIN and printing the AST on stdout.
  5. The output should fit within grammar for HIR.
  6. Include a README.txt file that contains:
    1. your name;
    2. a list of software dependencies necessary to run your assignment;
    3. a description of what you got working, what's partially working and what's completely broken; and
    4. a manifest briefly describing each file you've turned.
  7. Every source code file you turn in needs to have your name and student number at the top and the bottom in comments.
  8. The program should use a Makefile:
    1. make should compile the program (if compilation is necessary).
    2. make trans should accept input on STDIN and send output to STDOUT.
    3. make test should run any test cases you used.
  9. The program should output one S-expression containing the HIR code on success or print an error message containing the word error on failure.
  10. Your output need not match the reference implementation exactly. It should be semantically equivalent (unless the reference implementation has a bug).
  11. You can also test your output against a Python 3.x interpreter, but note that these may differ in how they display objects using print.

Simplifications

  1. Since project 2 snipped out the class system, you can't create objects, but you should still assume that you can access and set their fields with .name. (Note: The code to interpret HIR fails immediately with an error message if these constructs are invoked.)
  2. For numbers, assume only integers are used in the source and at run-time. (No floating point, no imaginary, no complex.)
  3. All try blocks have exactly one except clause, and no else or finally clauses, i.e.:
               try:
                 main code
               except:
                 failure code
             
  4. A raise can raise any value (not just exceptions).
  5. The single except clause for a try block does not specify which exception it catches, nor does it name it.
  6. You cannot delete a variable from the local scope. You can only delete fields and indices. That is, you can write:
              del a[i] 
    or
              del a.f 
    but not:
              del x 
    Of course, you could also write:
              del f(3)["foo"][i] 
  7. You can assume all global variables are defined at the top level before their first reference. If you want to handle global variables first defined locally using the global construct, talk to Matt. That is, you have to handle:
           y = 10
           def f(x):
              global y
              y = x
           
    but not:
           def f(x):
              global y
              y = x
           

Hints and tips

  1. In this project, the expressive advantage of using a functional language like Racket or Haskell over a language like C or Java starts to become substantial. Those wishing to complete this project in a non-functional language need to start immediately and ask for help early!
  2. A prominent feature in functional languages is the ability to pattern-match on data structures. This ability is the killer advantage in this project.
  3. If you want write your lexer, parser and translator in different languages, you can tie them together with a Unix pipe, like so:
         $ pylex | pyparse | ./pytrans
        

Code

This code is provided without warranty or guarantee of correctness. If you find a discrepancy between this code an the project description, notify Matt.

To execute HIR code, you can prepend hir-header.rkt to the result of your translation, and then execute it in racket. (hir-header.rkt provides macros and functions to define all of elements of HIR in terms of Racket.)

You can grab a reference implementation from pytrans_rkt.zo. It's compiled Racket bytecode for Racket 5.1. To run, use racket pytrans_rkt.zo and send input on STDIN.

Submission instructions

Upload here.

Project 4: A CPS translator for Python

Due: April 11th, 11:59 PM Utah time (+1 week)

Your task is to construct a translator from the subset of Python in Project 3 into continuation-passing style (CPS).

Of course, the expected way of accomplishing this is to translate HIR into CPS.

It is tedious to translate HIR directly to CPS. It is strongly recommended that you push HIR into a desugaring phase--dropping it down to LIR first.

Guidelines

  1. You can use any language you want for this project, but you must be able to run your program under Linux or Mac OS X. Once again, non-functional languages are at a steep expressiveness disadvantage, and will require substantially more time and code invested. Please plan and request help accordingly.
  2. If the TA or I can't get your code to run, we'll ask you to help us get it running.

Requirements

  1. You must translate the same simplified subset of Python from the Project 2 less the additional restrictions from Project 3.
  2. You should assume that Python source is coming in, not pre-lexed, pre-parsed input or pre-translated input.
  3. If you need to use a working lexer, you can assume that the program pylex is in the current PATH, and that it matches the behavior of the web app by consuming a file on STDIN and printing its tokens as S-Expressions line-by-line on STDOUT.
  4. If you need to use a working parser, you can assume that the program pyparse is in the current PATH, and that it matches the behavior of the web app by consuming a tokenized input on STDIN and printing the AST on stdout.
  5. If you need to use a working HIR translator, you can assume that the program pytrans is in the current PATH, and that it matches the behavior of the bytecode app by a Python AST on STDIN and printing the HIR AST on stdout.
  6. The output should fit within the grammar for CPS.
  7. Include a README.txt file that contains:
    1. your name;
    2. a list of software dependencies necessary to run your assignment;
    3. a description of what you got working, what's partially working and what's completely broken; and
    4. a manifest briefly (a few words) describing each file you've turned.
  8. Every source code file you turn in needs to have your name and student number at the top and the bottom in comments.
  9. The program should use a Makefile:
    1. make should compile the program (if compilation is necessary).
    2. make run should accept Python code on STDIN and send CPS code to STDOUT.
    3. make test should run any test cases you used.
  10. The program should output one S-expression containing the CPS code. The output for invalid input is undefined, but it is recommended that you print an error message containing the word error on failure.
  11. Your output need not match the reference implementation exactly. It should be semantically equivalent (unless the reference implementation has a bug).
  12. When CPS-converting, the continuation argument always comes last, even for primitive operations.
  13. You can also test your output against a Python 3.x interpreter.

Simplifications

  1. All object- and field-related operations have been removed from the language.

Hints and tips

  1. This project is optimally split into two phases: desugaring and CPS conversion.
  2. It is strongly recommended that you first translate HIR into LIR, and then translate LIR to CPS.
  3. Even if you don't understand the define-syntax form in detail, you may learn a great deal about how to desugar from hir-header.rkt. This is particularly true for the desugaring of try.
  4. A substantial harness is provided in the stub code. Even if you don't plan to use Racket, it is recommended that you look at the harness. In particular, try:
         $ make test-cps
         $ make test-lir
          
  5. The stub code contains a desugarer with all of the cases stubbed out. It also contains the "top level" for the reference CPS converter.
  6. Pattern-matching tree transforms once again dominate this project.

Reading material

Of the reading material for the course, the following are most helpful:

Code

This code is provided without warranty or guarantee of correctness. If you find a discrepancy between this code an the project description, notify Matt.

To execute LIR code, you can prepend lir-header.rkt to the result of your translation, and then execute it in racket.

To execute CPS code, you can prepend cps-header.rkt to the result of your translation, and then execute it in racket; for example, you could do:

     $ make run < myfile.py > myfile.cps
     $ cat cps-header.rkt myfile.cps > myfile.rkt
     $ racket myfile.rkt
    

You can grab a reference desugarer from pydesugar_rkt.zo.

You can grab a reference cps converter from pycps_rkt.zo.

These are compiled bytecode for Racket 5.1

Submission instructions

Upload here.

Project 5: A Python-to-C compiler

Due: April 25th, 11:59 PM Utah time

Your task is to construct a translator from the subset of Python in Project 4 into C.

Of course, the expected way of accomplishing this is to translate from CPS.

To make this transformation as easy as possible, it is recommended that you perform mutable variable elmination, flat closure conversion and lambda lifting. If you do, then the result may be more easily transliterated into C.

In terms of restructuring the language for these transforms, it is recommended that you follow this spec.

Guidelines

  1. You can use any language you want for this project, but you must be able to run your program under Linux or Mac OS X. Once again, non-functional languages are at a steep expressiveness disadvantage, and will require substantially more time and code invested. Please plan and request help accordingly.
  2. If the TA or I can't get your code to run, we'll ask you to help us get it running.

Requirements

  1. You must translate the same simplified subset of Python from the Project 2 less the additional restrictions from Projects 3 and 4.
  2. You should assume that Python source is coming in, not pre-lexed, pre-parsed input or pre-translated input.
  3. If you need to use a working lexer, you can assume that the program pylex is in the current PATH, and that it matches the behavior of the web app by consuming a file on STDIN and printing its tokens as S-Expressions line-by-line on STDOUT.
  4. If you need to use a working parser, you can assume that the program pyparse is in the current PATH, and that it matches the behavior of the web app by consuming a tokenized input on STDIN and printing the AST on stdout.
  5. If you need to use a working HIR translator, you can assume that the program pytrans is in the current PATH, and that it matches the behavior of the bytecode app by a Python AST on STDIN and printing the HIR AST on stdout.
  6. If you need to use a working LIR translator, you can assume that the program pydesugar is in the current PATH, and that it matches the behavior of the bytecode app by a Python AST on STDIN and printing the LIR AST on stdout.
  7. If you need to use a working CPS converter, you can assume that the program pycps is in the current PATH, and that it matches the behavior of the bytecode app by a Python AST on STDIN and printing the LIR AST on stdout.
  8. The output should be accepted by gcc.
  9. Include a README.txt file that contains:
    1. your name;
    2. a list of software dependencies necessary to run your assignment;
    3. a description of what you got working, what's partially working and what's completely broken; and
    4. a manifest briefly (a few words) describing each file you've turned.
  10. Every source code file you turn in needs to have your name and student number at the top and the bottom in comments.
  11. The program should use a Makefile:
    1. make should compile the program (if compilation is necessary).
    2. make run should accept Python code on STDIN and send C code to STDOUT.
    3. make test should run any test cases you used.
  12. The output for invalid input is undefined, but it is recommended that you print an error message containing the word error on failure.
  13. Your output need not match the reference implementation exactly. It should be semantically equivalent (unless the reference implementation has a bug).
  14. You can also test your output against a Python 3.x interpreter.

Hints and tips

  1. This project is optimally split into four small transforms: mutable variable elimination, closure conversion, lambda-lifting and C code emission.
  2. Tree transforms once again dominate this project.

Code

This code is provided without warranty or guarantee of correctness. If you find a discrepancy between this code an the project description, notify Matt.

To execute MVE'd, closue-converted or lambda-lifted code, you can prepend c-header.rkt to the result of your translation, and then execute it in racket.

You may find pytoc-common.rkt useful for writing these transforms.

You can grab a reference MVE from pymve_rkt.zo.

You can grab a reference closure-converter from pycc_rkt.zo.

You can grab a reference lambda-lifter from pyll_rkt.zo.

You can grab a mostly-completed C translator from pytoc.rkt.

These are compiled bytecode for Racket 5.1

You can find tests cases here.

Submission instructions

Upload here.