Yacc is dead: An update

[article index] [] [@mattmight] [rss]

The draft of "Yacc is Dead" that David Darais and I posted on arXiv received some attention, so we'd like to share updates and feedback.

(You don't have to read the draft to read this post; it's self-contained.)


source: wallpaper-9.com

The draft claims that Brzozowski's derivative eases parser implementation; that it handles all CFGs (yes, including left-recursive); that it seems to be efficient in practice; and that it should eventually be efficient in theory too.

We reply to Russ Cox's "Yacc is Not Dead" by running the example that he claims has best-case complexity of O(3n). It terminates instantly for n > 100.

This post also describes compaction (not in the draft), and makes a formal argument that the cost of parsing with derivatives is O(n|G|) on average.

We also explain the derivative again (with less academic prose), and release our latest implementations in Racket, Scala and Haskell.

[The Racket implementation minimizes the use of laziness, and it abstracts away the impure parts (caching and computing fixed points) behind a purely functional interface: define/memoize and define/fix.]

Read on for more, and stay tuned for updates.


Meta-update: Thanks to the discussion and feedback generated by the arxiv posting, a improved version of parsing with derivatives was published at ICFP! Many thanks to all that contributed!

Quick summary: Brzozowski's derivative

Brzozowski defined the derivative of a regular language in 1964.

The derivative of a language L with respect to a character c, written Dc(L), works in two phases--filter and chop:

  1. Find all of the strings in L that start with c.
  2. Chop the c off all those strings.

Formally, Dc(L) = { w | cwL }.

For example, Df(foo|frak|bar)* = (oo|rak)(foo|frak|bar)*

Matching with derivatives

It's easy to use the derivative to recognize whether a string is in a language.

Just compute the derivative with respect to the first character, then the second, then the third and so on.

If the final language contains the empty string, then the original string was in the original language. That's all there is to it.

For example, DoDoDf(foo|frak|bar)* = (foo|frak|bar)*, which clearly matches the empty string.

Mathematical notation for regular expressions

We'll use mathematical notation for regular expressions.

If you're used to Unix-style regular expressions, here's the translation guide:

Math Unix Meaning
no equivalent empty set; matches no strings
ε empty/null string; matches ""
c c matches character c
AB AB concatenation/sequence
AB A|B union/alternative/choice

In examples, we'll still use juxtaposition to denote concatenation.

The derivative for regular languages

The useful property of the derivative from an implementation point of view is that it has a recursive definition over regular languages:

  • Dc(∅) = ∅
  • Dc(ε) = ∅
  • Dc(c) = ε
  • Dc(c') = ∅ if c is not c'
  • Dc(AB) = Dc(A) ○ B if A does not contain the empty string
  • Dc(AB) = Dc(A) ○ BDc(B) if A contains the empty string
  • Dc(AB) = Dc(A) ∪ Dc(B).

The derivative for context-free languages

Since regular languages aren't structurally recursive, the naive implementation of the derivative terminates.

That's not the case for context-free grammars, which might be described as "recursive regular expressions."

Consider the classic context-free grammar, the language of balanced parens:

S = S ( S ) ∪ ε

Consider the derivative with respect to an open paren:

[D( S] = [D( S] ( S ) ∪ [S )]

It's infinitely recursive.

Fine for math. Bad for implementation.

Derivatives from laziness, memoizing and fixed points

Fortunately, it's not hard to stop this infinite recursion.

If we compute lazily, then the derivative terminates.

But, because the grammar is left-recursive, taking another derivative will force the computation when it tries to check whether the derived language contains the empty string.

So, that nullability check causes non-termination.

But, that's easy to fix too: just memoize the derivative.

In fact, the hardest part about computing the derivative is figuring out whether or not a language contains the empty string. The definition for nullability, δ(L), is also structurally recursive:

  • δ(∅) = false
  • δ(ε) = true
  • δ(c) = false
  • δ(AB) = δ(A) and δ(B)
  • δ(AB) = δ(A) or δ(B).

Laziness and memoization don't work here.

Instead, the function δ has to be computed as a least fixed point.

So, even though Brzozowski defined the derivative with regular languages in mind, it works unaltered for context-free languages if you toss in laziness, memoization and fixed points.

The draft describes two ways to generate parse forests from this idea.

Origins and motivation

The inspiration for the draft came from my advanced compilers course.

My teaching assumes that to understand is to implement.

For instance, to teach lexical analysis, I had the class implement a lex-like DSEL using Brzozowski's derivatives.

When we moved to parsers, I wanted to have the class build a DSEL that provides the functionality of a tool like ANTLR or yacc.

Wanted: An easy way to do general parsing

I'm against parsing tools that don't protect the user from surprises.

LALR(1) generators are out because they don't accept all grammars.

(Imagine a regex library that didn't accept all regexes!)

Adding shift/reduce annotations to a specification feels wrong to me.

You don't have to know how a car works to drive one.

Why shouldn't we expect the same usability out of parsing tools?

What about parser combinators?

Parser combinators are great because they're in the language itself. That avoids the transaction cost of setting up and learning an external tool.

That lowered cost alone makes people more likely to use them instead of hopelessly flogging regular expressions beyond their design limits.

But, parser combinators are tricky to implement if you want avoid surprises and handle things like left recursion.

I needed something undergrads could learn and do in a week.

(If you've taken my classes, you know I consider ease of implementation one of the most important attributes of an algorithm.)

Derivatives to the rescue?

It wasn't initially obvious to me that derivatives could work for parsing, or that they would enable the abbreviated implementation of a parsing library.

Eventually David and I came to the conclusion that derivatives actually do work. In fact, there are two distinct ways to make them work!

One approach applies the derivative to parser combinators; the other approach applies the derivative to CFGs to create a small-step parsing machine that emits parse strings as it runs.

David produced a working implementation in Haskell in a couple days.

It took me about three to do it in Scala.

What shocked us was just how simple and flexible the implementation was.

It seemed absurd that in a couple hundred lines of code, we could be generating parse forests for any CFG--left recursive, right recursive or even infinitely recursive.

It was such a fun experience that we felt like sharing it with the community.

So, the draft was born.

ESOP rejection

The draft was rejected by ESOP 2010.

The reviews are not cursory, and they contain good points.

To summarize, the three main complaints about the paper are:

  1. Failure to qualitatively compare with previous work in parsing.
  2. Failure to characterize the complexity of the algorithms.
  3. Failure to benchmark the implementations against existing tools.

These are fair criticisms of the paper.

An implicit failure of the paper is that we didn't convince the peer reviewers that this was a fun, easy way to implement a general parsing library.

One year later: Posted on arXiv

My research area is static analysis, so I didn't have a lot of time to devote to resuscitating a paper on parsing.

But, David needed to cite it for his Ph.D. school applications this year, so we tossed it on arXiv as it stood when it was rejected a year ago.

Had we known the attention it was about to receive, we'd have incorporated reviewer feedback and updated the paper with what we've learned since.

What we've learned

Over the past year, even in the little time we've had to work on the paper, we've learned a lot more about parsing with derivatives.

In the week after the community found it, you all taught us ten times more than that. Thank you!

To highlight just a few interactions:

  • Stuart Kurtz pointed out how parsing with derivatives is elegantly suited for grammars in Greibach Normal Form. Stuart has also shed light on how to relate the idea to the traditional notions of bottom-up and top-down parsing.
  • Daniel Spiewak independently re-implemented the approach, but in less than 150 lines of Scala. (My implementation was 250 lines.) Daniel observed the same practical efficiency we saw in our own tests. Daniel also came up with a good framework for reasoning about complexity when parsing with derivatives.
  • Yitzhak Mandelbaum and Trevor Jim sent suggestions for improvements, asked questions about extending it to dependent parsing, and offered a heap of benchmarks from their Yakker tool.
  • Mitch Wand emailed a marked-up copy of the paper with loads of comments, and pointed to opportunities for collaboration.
  • Russ Cox wrote a thoughtful criticism of complexity, though I disagree with Russ that the complexity comes from backtracking. There's no backtracking in the algorithm. That's what makes it so easy to implement. Specifically, Russ claimed that our implementation would hit Ω(3n) complexity on a specific invalid string for the following ambiguous grammar:
       S ::= S + S | 1
    
    In actuality, it takes milliseconds even for n > 100. Details below.
  • Burak Emir wrote a thoughtful reply to Russ. Russ correctly asserts that a random grammar is likely to perform poorly, but Burak noted our conjecture was for grammars of human interest.
  • Countless commenters have pointed out (or asked about) potential relationships to existing methods of parsing.

It's been an inspiring (if accidental) experiment in "naked science."

Naive complexity: Exponential

We had inklings that the worst-case complexity of the naive implementation was exponential, and we've since been able to build pathological grammars that exhibit exponential behavior.

It is possible to double the size of the grammar with each derivative.

But, since exponentiality wasn't a problem for us in practice, we haven't been overly concerned with it.

As Alan Perlis once remarked, "for every polynomial-time algorithm you have, there is an exponential algorithm that I would rather run."

Parsing with derivatives might just be one of those exponential algorithms.

Average complexity: Linear?

Daniel Spiewak's model for complexity states that the cost of parsing with the derivative is equal to the sum of the cost of the n derivatives taken during parsing.

The cost of an individual derivative is bounded by the size of the grammar. (In practice, the cost is roughly constant; it takes a pathological grammar to impact every single nonterminal.)

Let Gi be the ith grammar to appear during parsing.

Under Daniel's model, the cost of parsing is proportional to:

|G0| + |G1| + |G2| + ... |Gn|

During our measurements, we found that the derived grammars remain roughly the same size as the original grammar after compaction, regardless of how many derivatives are taken.

Since the cost of compaction is also bounded by the size of the grammar, the average complexity of parsing with derivatives is O(n|G|).

Optimized complexity: Cubic?

Our intuition has always been that with the right caching and structural reductions, the parsing with derivatives ought to be no worse than cubic in the worst case.

In practice, we've found it's easier to exploit structural sharing for the parsing machine approach over the parser combinator approach, but our intuition says that whatever can be done for one is doable for the other.

David has developed a powerful approach to pruning, weeding and compacting derived grammars using a version of the algorithm written in continuation-passing style.

He's now working on a new version of that algorithm that uses zippers and fixed points; our intuition tells us that this approach will perform the fewest possible number of grammar reductions to produce a parse tree.

Daniel Spiewak believes it can be made cubic with a different technique.

Latest implementations

There are several implementations now:

As usual, anything David or I have done is released under the CRAPL.

If you create an implementation, please let me know.

Implementation in Racket

The Racket implementation is meant to be instructive. By default, it omits even the simple optimizations made in the original Scala implementation.

The implementation is less than 240 lines of commented code.

It is the most faithful to the paper, because I added define/memoize and define/fix forms using macros. (The implementation is 130 lines without the definitions of these functions.)

All of the side effects hide behind these two forms, which export a purely functional interface, so in some sense, this implementation is also purely functional.

Some complained that parsing with derivatives uses "too much" laziness to be practical in other languages, so I used explicit promises to implement laziness in Racket, and then, only where strictly necessary.

This version, for instance, does not compute lazy parse forests.

It returns all parse trees at once.

The derive procedure looks exactly like the math:

(define/memoize (parse-derive c l)
  #:order ([l #:eq] [c #:equal])
  (match l
    [(empty)     (empty)]
    [(eps)       (empty)]
    [(token pred class)
     ; =>
     (if (pred c) (eps* (set c)) (empty))]
    
    [(orp l1 l2)
     ; =>
     (alt (parse-derive c l1) 
          (parse-derive c l2))]
    
    [(seqp (and (nullablep?) l1) l2)
     ; =>
     (alt (cat (eps* (parse-null l1)) (parse-derive c l2))
          (cat (parse-derive c l1) l2))]
    
    [(seqp l1 l2)
     ; =>
     (cat (parse-derive c l1) l2)]
    
    [(redp l f)
     ; =>
     (red (parse-derive c l) f)]))

The #:order keyword tells the memoizer in which order to cache the arguments, and which equality test to use.

The nullability procedure appears to be infinitely recursive:

(define/fix (nullable? l)
  #:bottom #f
  (match l
    [(empty)           #f]
    [(eps)             #t]    
    [(token _ _)       #f]
    [(orp l1 l2)       (or (nullable? l1) (nullable? l2))]
    [(seqp l1 l2)      (and (nullable? l1) (nullable? l2))]
    [(redp l1 _)       (nullable? l1)]))

But, define/fix saves it by computing the least fixed point if it detects the function recurring over a cyclic graph instead of a tree. The #:bottom argument specifies where to begin the fixed point computation.

The auxiliary procedure parse-null parses a grammar with respect to the empty string, also using a fixed point:

(define/fix (parse-null l)
  #:bottom (set)
  (match l
    [(empty)        (set)]
    [(eps* S)       S]
    [(eps)          (set l)]
    [(token _ _)    (set)]
    [(orp l1 l2)    (set-union (parse-null l1) (parse-null l2))]
    [(seqp l1 l2)   (for*/set ([t1 (parse-null l1)]
                               [t2 (parse-null l2)])
                              (cons t1 t2))]
    [(redp l1 f)    (for/set ([t (parse-null l1)])
                             (f t))]))

The implementation of parse is short and sweet:

(define (parse l s)
  (cond
    [(stream-null? s) (parse-null l)]
    [else             (parse (parse-derive (stream-car s) l) 
                             (stream-cdr s))]))

Ambiguity: Not a problem

Russ Cox claimed in his post that the following ambiguous grammar would blow up our implementation:

 S ::= S + S | 1

His argument was that our algorithm was somehow doing backtracking.

When we tried it out (valid and invalid inputs), it worked fine:

(define good-input '(N + N + N ... + N)) 
(define bad-input  '(N + N + N ... + + N))

(display (format "good: ~s~n" (length good-input)))

(display (format "bad: ~s~n" (length bad-input)))

(time (recognize? S (list->stream good-input)))
(time (recognize? S (list->stream bad-input)))

returns:

good: 101
bad: 102
cpu time: 73 real time: 76 gc time: 0
#t
cpu time: 43 real time: 44 gc time: 0
#f

Times are measured in milliseconds.

Update: To be clear, Russ is right that trying to non-lazily compute all the parse trees is inherently exponential.

What this example shows is that derivatives can still accept/reject without considering all possible parse trees.

And, if parse trees are computed lazily, you don't have to consider them all.

(Or, you can use Ruzzo's 1979 algorithm to convert a recognizer into a parser with only a logarithmic penalty in the size of the input.)

Parsing with derivatives is different enough from other methods that in order to understand it, you really have to play around with it.

Even my intuitions are still trumped by experience with the implementation.

The Racket and Haskell implementations each include a dot-file renderer that allows insight-building visualizations of derived grammars.

Compaction

The biggest development since the draft is compaction.

Compaction shrinks the size of the grammar, and it can eliminate recursive references which would otherwise stick around, hogging memory.

The draft does mention simplifying reductions on grammars.

For example, ∅ ○ A = ∅ and B ∪ ∅ = B.

If these simplifying reductions are performed recursively and memoized, you end up with a tight compactor for grammars:

; Note: nullp matches languages which are exactly
; the empty string--not languages which just contain 
; the empty string. nullablep matches those.
(define/memoize (compact [l #:eq])
  (match l
    [(empty)       l]
    [(eps)         l]
    [(emptyp)      (empty)]
    [(nullp)       (eps* (parse-null l))]
    [(token p c)   l]

    [(orp (emptyp) l2)  (compact l2)]
    [(orp l1 (emptyp))  (compact l1)]
    
    [(seqp (nullp t) l2)  (red (compact l2) (lambda (w2) (cons t w2)))]
    [(seqp l1 (nullp t))  (red (compact l1) (lambda (w1) (cons w1 t)))]
    
    [(orp l1 l2)   (alt (compact l1) (compact l2))]
    [(seqp l1 l2)  (cat (compact l1) (compact l2))]
    
    [(redp (and e (nullp)) f) 
     ; =>
     (eps* (for/set ([t (parse-null e)]) (f t)))]
    
    [(redp (seqp (nullp t) l2) f)
     ; =>
     (red (compact l2) (lambda (w2) (f (cons t w2))))]
    
    [(redp (redp l f) g) 
     ; =>
     (red (compact l) (lambda (w) (g (f w))))]
        
    [(redp l f)    (red (compact l) f)]))

Interleaving compaction and derivation keeps grammars roughly the same size throughout parsing.

Without compaction, these are the sizes of the grammar (and memory usage) after each derivative on a simple list of about 100 tokens:

size: 18; mem: 35411644
size: 34; mem: 35425164
size: 56; mem: 35443372
size: 84; mem: 35469044
size: 118; mem: 35504188
size: 158; mem: 35553100
size: 204; mem: 35611372
// ... about 60 derivatives later.
size: 261650; mem: 129317020
size: 272359; mem: 133152924
size: 283338; mem: 137075028
size: 294590; mem: 141042592
size: 306118; mem: 144652724
size: 317925; mem: 148514052
size: 330014; mem: 152762344
size: 342388; mem: 157118524
size: 355050; mem: 161558660
^Cuser break

I had to quit after about 75 derivatives, because my machine was locked up.

With compaction turned on, these are the sizes of the grammar (and memory usage) after each derivative:

size: 12; mem: 217035240
size: 15; mem: 217079664
size: 18; mem: 217115032
size: 21; mem: 217154032
size: 24; mem: 217202808
size: 27; mem: 217256200
size: 30; mem: 217317536
size: 33; mem: 217392216
size: 36; mem: 217471024
size: 39; mem: 217565616
size: 42; mem: 217657380
size: 45; mem: 217760612
size: 43; mem: 217877292
size: 40; mem: 218010620
size: 37; mem: 218113580
size: 34; mem: 218194388
size: 31; mem: 218271924
size: 28; mem: 218337300
size: 25; mem: 218406196
size: 22; mem: 218455836
size: 19; mem: 218500164
size: 16; mem: 218536804
size: 16; mem: 218589524
size: 17; mem: 218640748
size: 17; mem: 218703388
// ... (stays 17 for 94 more deriavtives)
size: 17; mem: 222544772
size: 17; mem: 222585892
size: 19; mem: 222620636
size: 22; mem: 222661852
size: 25; mem: 222707660
size: 23; mem: 222758316
size: 20; mem: 222810180
size: 17; mem: 222849524
size: 13; mem: 222897964

And, this time, it returned interactively.

Derivatives not only add a lot of structure in a grammar, they also invalidate a lot of structure and make it available for simplification.

Since the net gain or loss in nodes is usually small, this suggests optimizations that reuse recently deactivated nodes.

In particular, the start node often appears only at the top level.

So, much of the time, it should be safe to destructively transform it into the next derivative, thereby avoiding allocation and deallocation altogether.

A port of this technique to non-garbage-collected languages will certainly want to exploit this fact.

Related work

Total Parser Combinators by Nils Danielsson appeared at ICFP 2010.

Danielsson also uses Brzozowski's derivative, but somewhat differently, and toward a different end: proving termination for parsers.

The key differences between our work and Danielsson's are:

  1. Danielsson's approach requires manual annotation of grammars with delays (sharps) and forces (flats) to guarantee termination. We are viscerally against annotations, like shift/reduce or sharp/flat, that require the user to understand how a parsing library was implemented. We require no such annotations.
  2. Danielsson doesn't handle all grammars. (Some kinds of left recursion don't work.) We handle all grammars, even infinitely recursive ones.
  3. Danielsson doesn't simplify or compact derived grammars, leading to terrible performance in practice.
  4. Because of sharp/flat annotations, Danielsson's definition of the derivative for sequences is markedly more complex.

One ESOP reviewer, commenting on what was then an unpublished draft of what would later evolve into Danielsson's submission to ICFP, noted that there should be room for both papers should Danielsson's be published. We fully agree, which is why we'll be trying again with ours.

How you can help kill yacc

What we need now are more implementations in other languages and "realistic" benchmarks in those implementations.

If you put one together or build some benchmarks, please send it our way.

In our own implementations, we've noticed that execution becomes dominated by garbage collection time as inputs grow larger.

Our diagnosis is a lack of sharing among derived grammars.

The same sub-grammar will reappear frequently during parsing, but we fail to discover this automatically. Instead, our implementations duplicate and release lots of tiny data structures, which stresses the garbage collector.

We (David, I and now also Daniel Spiewak) are working on several solutions to the "sharing" issue, but we'd love to hear from if you have an idea as well.

More resources