Compiling to Java
In my advanced compilers class, we cover a range of intermediate and target languages. C is a popular target language for performance reasons, but Java has often-overlooked advantages. Some of Java's less-utilized features conspire to make it an easy target for high-level language constructs, e.g., lexically scoped closures become anonymous classes. It takes roughly a third to half the time (and code) to write a compiler that targets Java instead of C. In fact, we can compile the essential core of Scheme to Java with purely local transforms. (Scheme's macro system can easily desugar full Scheme into core Scheme.) As a result, the reference compiler at the bottom of this page is barely over 400 lines of highly-commented code. (The companion compiler into C is just over 1000 lines of code.)
Helpful resources:
- The official Java Language Specification covers the nooks and crannies of the language which, while rarely or never used in ordinary programming, become extremely convenient during compilation.
Related blog posts:
Pros and cons
On top of these implementation benefits, targeting Java gives the language access to the Java library system, the portability of the JVM, just-in-time compilation/optimization and garbage collection. The lack of tail-call optimization in Java is a downside with the direct compilation strategy, but it's possible to do less direct compilation into Java that performs tail-call optimization with trampolining.
More broadly, in my class, I teach that language implementation is about trade-offs between implementation complexity and performance:
- I argue that first, you should try an interpreter for your language.
- If that's not fast enough, try an SICP-style optimizing interpreter.
- If that's not good enough, try compiling to Java.
- If that's still not fast enough, try compiling to C.
- If that's still too slow, try continuation-passing-style compilation to assembly.
- If you need more speed, start doing basic compiler optimizations.
- If you're still out of luck, start doing static-analysis-driven compiler optimizations.
Each time you go down from n to n + 1 on this ladder, performance will go up, but implementation code size and complexity will go up by about a factor of two. It turns out that Java occupies a sweet spot: relatively low implementation complexity, but the biggest percentage-wise gain in performance.
Core Scheme with Sugar
The compiler I created is for a core Scheme, but it would not be hard to apply these same techniques to compiling a language like Python or Ruby:
<exp> ::= <const> | <prim> | <var> | (lambda (<var> ...) <exp>) | (if <exp> <exp> <exp>) | (set! <var> <exp>) | (let ((<var> <exp>) ...) <exp>) | (letrec ((<var> (lambda (<var>...) <exp>))) <exp>) | (begin <exp> ...) | (<exp> <exp> ...) <const> ::= <int>
A Scheme compiler could easily use macros to desugar full Scheme into this language, or in fact, an even simpler one. Many real Scheme compilers do exactly that.
Encoding Scheme values in Java
The first task in compiling to Java is to come up with a
Java-level encoding of the corresponding Scheme values.
To do that, I created the Java interface Value
, and had
all Scheme values inherit from that.
The sub-classes and -interfaces of Value
include
VoidValue
, BooleanValue
,
IntValue
, ProcValue
and
Primitive
.
The Java stub code also provies a RuntimeEnvironment
class, which binds all of the top-level primitives like addition,
subtraction and I/O to Java names.
Compiled programs are supposed to inherit from
RuntimeEnvironment
.
The top-level compile function has a typically Schemish dispatching feel:
; java-compile-exp : exp -> string (define (java-compile-exp exp) (cond ; core forms: ((const? exp) (java-compile-const exp)) ((prim? exp) (java-compile-prim exp)) ((ref? exp) (java-compile-ref exp)) ((lambda? exp) (java-compile-lambda exp)) ((if? exp) (java-compile-if exp)) ((set!? exp) (java-compile-set! exp)) ; syntactic sugar: ((let? exp) (java-compile-exp (let=>lambda exp))) ((letrec1? exp) (java-compile-exp (letrec1=>Y exp))) ((begin? exp) (java-compile-exp (begin=>let exp))) ; applications: ((app? exp) (java-compile-app exp))))So, compilation breaks down into individual constructs.
Integers
Integers compile into a IntValue
objects, rather than to themselves.
For example, Scheme-level 3
compiles to new IntValue(3)
in Java.
Primitives
Primitives are looked up in table and translated into their RuntimeEnvironment
name:
(define (java-compile-prim p) (cond ((eq? '+ p) "sum") ((eq? '- p) "difference") ((eq? '* p) "product") ((eq? '= p) "numEqual") ((eq? 'display p) "display") (else (error "unhandled primitive " p))))
Variable references
Variable references have to be name-mangled, since Scheme
identifiers are richer than Java identfiers.
Mutable variables (those which are set!'d) are also handled differently.
These are wrapped in ValueCell
objects and prefixed with
m_
, since variables captured by anonymous functions in
Java have to be marked final
.
A pre-compilation code walk finds all of the mutable variables.
Lambda terms
Lambda terms are compiled into anonymous classes.
For example, (lambda (v1 ... vN) exp)
becomes:
new NullProcValueN () { public apply (final Value [mangle v1],...,final Value [mangle vN]) { // for each mutable formal vi: final ValueCell m_[mangle vi] = new ValueCell([mangle vi]) ; return [compile exp] ; }There is a
NullProcValueN
class for each procedure arity N.
The NullProcValue
classes provide default definitions for some of the methods defined in
Value
.
Clearly, [mangle v]
stands for the mangled name of the variable v
,
and [compile exp]
stands for the compiled text of exp
.
Conditionals
The form (if exp1 exp2 exp3)
actually compiles into
the ternary operator ?:
instead of if () {} else
{}
:
([compile exp1]) ? ([compile exp2]) : ([compile exp3])
Mutable variables
The construct (set! var exp)
relies on the
compilation of lambda terms and variables references to wrap
var
in a ValueCell
, so that it compiles to:
VoidValue.Void(m_[mangle var].value = [compile exp])
Variable binding
The let construct desugars into the application of a lambda term.
That is, (let ((v e) ...) body)
becomes:
((lambda (v ...) body) e ...)
Recursion
Letrec
can be desugared into "lets and sets" or the Y
combinator.
I opted for the Y
combinator just to show that it can be done without using side
effects.
Actually, the compiler generates a new Y combinator on the fly so that
it matches the arity of the recursive procedure:
; xargs : nat -> list[symbol] (define (xargs n) (if (<= n 0) '() (cons (string->symbol (string-append "x" (number->string n))) (xargs (- n 1))))) ; Yn generates the Y combinator for n-arity procedures. (define (Yn n) `((lambda (h) (lambda (F) (F (lambda (,@(xargs n)) (((h h) F) ,@(xargs n)))))) (lambda (h) (lambda (F) (F (lambda (,@(xargs n)) (((h h) F) ,@(xargs n))))))))
In Java, the Y combinator for one-argument procedures ends up as:
((ProcValue1)(new NullProcValue1 () { public Value apply(final Value h) { return new NullProcValue1 () { public Value apply(final Value F) { return ((ProcValue1)(F)).apply(new NullProcValue1 () { public Value apply(final Value x) { return ((ProcValue1)(((ProcValue1)(((ProcValue1)(h)).apply(h) )).apply(F) )).apply(x) ; }} ) ; }} ; }} )).apply(new NullProcValue1 () { public Value apply(final Value h) { return new NullProcValue1 () { public Value apply(final Value F) { return ((ProcValue1)(F)).apply(new NullProcValue1 () { public Value apply(final Value x) { return ((ProcValue1)(((ProcValue1)(((ProcValue1)(h)).apply(h) )).apply(F) )).apply(x) ; }} ) ; }} ; }} )
Sequencing
Sequencing statements are desugared into let-bindings
of unused variables. That is, (begin e1 ... eN)
becomes:
(let ((_ e1)) (begin e2 ... eN))
Code
External resources
- As mentioned, if you want to write a compiler into Java, it helps to know all of Java's features. For that, the official language specification is indispensible: I'd been programming Java for years when I read this book for my Ph.D. qualifier, and I remember being stunned by all the stuff I never knew was in Java.