Quick summary: Brzozowski's derivative
Brzozowski defined the derivative of a regular language in 1964.
The derivative of a language L with respect to a character c, written D_{c}(L), works in two phases--filter and chop:
- Find all of the strings in L that start with c.
- Chop the c off all those strings.
Formally, D_{c}(L) = { w | cw ∈ L }.
For example,
D_{f}(foo|frak|bar)*
=
(oo|rak)(foo|frak|bar)*
Matching with derivatives
It's easy to use the derivative to recognize whether a string is in a language.
Just compute the derivative with respect to the first character, then the second, then the third and so on.
If the final language contains the empty string, then the original string was in the original language. That's all there is to it.
For example,
D_{o}D_{o}D_{f}(foo|frak|bar)*
=
(foo|frak|bar)*
, which clearly matches the empty string.
Mathematical notation for regular expressions
We'll use mathematical notation for regular expressions.
If you're used to Unix-style regular expressions, here's the translation guide:
Math | Unix | Meaning |
∅ | no equivalent | empty set; matches no strings |
ε | empty/null string; matches "" | |
c |
c |
matches character c |
A ○ B | AB |
concatenation/sequence |
A ∪ B | A|B |
union/alternative/choice |
In examples, we'll still use juxtaposition to denote concatenation.
The derivative for regular languages
The useful property of the derivative from an implementation point of view is that it has a recursive definition over regular languages:
- D_{c}(∅) = ∅
- D_{c}(ε) = ∅
- D_{c}(c) = ε
- D_{c}(c') = ∅ if c is not c'
- D_{c}(A ○ B) = D_{c}(A) ○ B if A does not contain the empty string
- D_{c}(A ○ B) = D_{c}(A) ○ B ∪ D_{c}(B) if A contains the empty string
- D_{c}(A ∪ B) = D_{c}(A) ∪ D_{c}(B).
The derivative for context-free languages
Since regular languages aren't structurally recursive, the naive implementation of the derivative terminates.
That's not the case for context-free grammars, which might be described as "recursive regular expressions."
Consider the classic context-free grammar, the language of balanced parens:
(
S
)
∪
ε
Consider the derivative with respect to an open paren:
(
S
)
∪
[S )
]
It's infinitely recursive.
Fine for math. Bad for implementation.
Derivatives from laziness, memoizing and fixed points
Fortunately, it's not hard to stop this infinite recursion.
If we compute lazily, then the derivative terminates.
But, because the grammar is left-recursive, taking another derivative will force the computation when it tries to check whether the derived language contains the empty string.
So, that nullability check causes non-termination.
But, that's easy to fix too: just memoize the derivative.
In fact, the hardest part about computing the derivative is figuring out whether or not a language contains the empty string. The definition for nullability, δ(L), is also structurally recursive:
- δ(∅) = false
- δ(ε) = true
- δ(c) = false
- δ(A ○ B) = δ(A) and δ(B)
- δ(A ∪ B) = δ(A) or δ(B).
Laziness and memoization don't work here.
Instead, the function δ has to be computed as a least fixed point.
So, even though Brzozowski defined the derivative with regular languages in mind, it works unaltered for context-free languages if you toss in laziness, memoization and fixed points.
The draft describes two ways to generate parse forests from this idea.
Origins and motivation
The inspiration for the draft came from my advanced compilers course.
My teaching assumes that to understand is to implement.
For instance, to teach lexical analysis, I had the class implement
a lex
-like DSEL using Brzozowski's derivatives.
When we moved to parsers, I wanted to have the class build a DSEL that provides the functionality of a tool like ANTLR or yacc.
Wanted: An easy way to do general parsing
I'm against parsing tools that don't protect the user from surprises.
LALR(1) generators are out because they don't accept all grammars.
(Imagine a regex library that didn't accept all regexes!)
Adding shift/reduce
annotations to a specification feels wrong to me.
You don't have to know how a car works to drive one.
Why shouldn't we expect the same usability out of parsing tools?
What about parser combinators?
Parser combinators are great because they're in the language itself. That avoids the transaction cost of setting up and learning an external tool.
That lowered cost alone makes people more likely to use them instead of hopelessly flogging regular expressions beyond their design limits.
But, parser combinators are tricky to implement if you want avoid surprises and handle things like left recursion.
I needed something undergrads could learn and do in a week.
(If you've taken my classes, you know I consider ease of implementation one of the most important attributes of an algorithm.)
Derivatives to the rescue?
It wasn't initially obvious to me that derivatives could work for parsing, or that they would enable the abbreviated implementation of a parsing library.
Eventually David and I came to the conclusion that derivatives actually do work. In fact, there are two distinct ways to make them work!
One approach applies the derivative to parser combinators; the other approach applies the derivative to CFGs to create a small-step parsing machine that emits parse strings as it runs.
David produced a working implementation in Haskell in a couple days.
It took me about three to do it in Scala.
What shocked us was just how simple and flexible the implementation was.
It seemed absurd that in a couple hundred lines of code, we could be generating parse forests for any CFG--left recursive, right recursive or even infinitely recursive.
It was such a fun experience that we felt like sharing it with the community.
So, the draft was born.
ESOP rejection
The draft was rejected by ESOP 2010.
The reviews are not cursory, and they contain good points.
To summarize, the three main complaints about the paper are:
- Failure to qualitatively compare with previous work in parsing.
- Failure to characterize the complexity of the algorithms.
- Failure to benchmark the implementations against existing tools.
These are fair criticisms of the paper.
An implicit failure of the paper is that we didn't convince the peer reviewers that this was a fun, easy way to implement a general parsing library.
One year later: Posted on arXiv
My research area is static analysis, so I didn't have a lot of time to devote to resuscitating a paper on parsing.
But, David needed to cite it for his Ph.D. school applications this year, so we tossed it on arXiv as it stood when it was rejected a year ago.
Had we known the attention it was about to receive, we'd have incorporated reviewer feedback and updated the paper with what we've learned since.
What we've learned
Over the past year, even in the little time we've had to work on the paper, we've learned a lot more about parsing with derivatives.
In the week after the community found it, you all taught us ten times more than that. Thank you!
To highlight just a few interactions:
- Stuart Kurtz pointed out how parsing with derivatives is elegantly suited for grammars in Greibach Normal Form. Stuart has also shed light on how to relate the idea to the traditional notions of bottom-up and top-down parsing.
- Daniel Spiewak independently re-implemented the approach, but in less than 150 lines of Scala. (My implementation was 250 lines.) Daniel observed the same practical efficiency we saw in our own tests. Daniel also came up with a good framework for reasoning about complexity when parsing with derivatives.
- Yitzhak Mandelbaum and Trevor Jim sent suggestions for improvements, asked questions about extending it to dependent parsing, and offered a heap of benchmarks from their Yakker tool.
- Mitch Wand emailed a marked-up copy of the paper with loads of comments, and pointed to opportunities for collaboration.
- Russ Cox wrote a thoughtful criticism of complexity, though I disagree with Russ that the
complexity comes from backtracking.
There's no backtracking in the algorithm.
That's what makes it so easy to
implement.
Specifically, Russ claimed that our implementation would
hit Ω(3^{n}) complexity
on a specific
invalid string for the following ambiguous grammar:
S ::= S + S | 1
In actuality, it takes milliseconds even for n > 100. Details below. - Burak Emir wrote a thoughtful reply to Russ. Russ correctly asserts that a random grammar is likely to perform poorly, but Burak noted our conjecture was for grammars of human interest.
- Countless commenters have pointed out (or asked about) potential relationships to existing methods of parsing.
It's been an inspiring (if accidental) experiment in "naked science."
Naive complexity: Exponential
We had inklings that the worst-case complexity of the naive implementation was exponential, and we've since been able to build pathological grammars that exhibit exponential behavior.
It is possible to double the size of the grammar with each derivative.
But, since exponentiality wasn't a problem for us in practice, we haven't been overly concerned with it.
As Alan Perlis once remarked, "for every polynomial-time algorithm you have, there is an exponential algorithm that I would rather run."
Parsing with derivatives might just be one of those exponential algorithms.
Average complexity: Linear?
Daniel Spiewak's model for complexity states that the cost of parsing with the derivative is equal to the sum of the cost of the n derivatives taken during parsing.
The cost of an individual derivative is bounded by the size of the grammar. (In practice, the cost is roughly constant; it takes a pathological grammar to impact every single nonterminal.)
Let G_{i} be the ith grammar to appear during parsing.
Under Daniel's model, the cost of parsing is proportional to:
During our measurements, we found that the derived grammars remain roughly the same size as the original grammar after compaction, regardless of how many derivatives are taken.
Since the cost of compaction is also bounded by the size of the grammar, the average complexity of parsing with derivatives is O(n|G|).
Optimized complexity: Cubic?
Our intuition has always been that with the right caching and structural reductions, the parsing with derivatives ought to be no worse than cubic in the worst case.
In practice, we've found it's easier to exploit structural sharing for the parsing machine approach over the parser combinator approach, but our intuition says that whatever can be done for one is doable for the other.
David has developed a powerful approach to pruning, weeding and compacting derived grammars using a version of the algorithm written in continuation-passing style.
He's now working on a new version of that algorithm that uses zippers and fixed points; our intuition tells us that this approach will perform the fewest possible number of grammar reductions to produce a parse tree.
Daniel Spiewak believes it can be made cubic with a different technique.
Latest implementations
There are several implementations now:
- David's latest Haskell version, with many unpublished optimizations, and the ability to print out intermediate grammars as dot files;
- a newer Scala version I wrote about six months ago, which interacts with a derivative-based lexer I also wrote;
- a Racket version I wrote on Sunday.
- a Common Lisp implementation by Gregory Pfeil.
- a very small Scala implementation by Daniel Spiewak.
- An implementation in Python that uses decorators.
As usual, anything David or I have done is released under the CRAPL.
If you create an implementation, please let me know.
Implementation in Racket
The Racket implementation is meant to be instructive. By default, it omits even the simple optimizations made in the original Scala implementation.
The implementation is less than 240 lines of commented code.
It is the most faithful to the paper, because I added
define/memoize
and define/fix
forms using macros.
(The implementation is 130 lines without
the definitions of these functions.)
All of the side effects hide behind these two forms, which export a purely functional interface, so in some sense, this implementation is also purely functional.
Some complained that parsing with derivatives uses "too much" laziness to be practical in other languages, so I used explicit promises to implement laziness in Racket, and then, only where strictly necessary.
This version, for instance, does not compute lazy parse forests.
It returns all parse trees at once.
The derive procedure looks exactly like the math:
(define/memoize (parse-derive c l) #:order ([l #:eq] [c #:equal]) (match l [(empty) (empty)] [(eps) (empty)] [(token pred class) ; => (if (pred c) (eps* (set c)) (empty))] [(orp l1 l2) ; => (alt (parse-derive c l1) (parse-derive c l2))] [(seqp (and (nullablep?) l1) l2) ; => (alt (cat (eps* (parse-null l1)) (parse-derive c l2)) (cat (parse-derive c l1) l2))] [(seqp l1 l2) ; => (cat (parse-derive c l1) l2)] [(redp l f) ; => (red (parse-derive c l) f)]))
The #:order
keyword tells the memoizer in which order to cache
the arguments, and which equality test to use.
The nullability procedure appears to be infinitely recursive:
(define/fix (nullable? l) #:bottom #f (match l [(empty) #f] [(eps) #t] [(token _ _) #f] [(orp l1 l2) (or (nullable? l1) (nullable? l2))] [(seqp l1 l2) (and (nullable? l1) (nullable? l2))] [(redp l1 _) (nullable? l1)]))
But, define/fix
saves it by computing the least fixed point
if it detects the function recurring over a cyclic graph instead of a tree.
The #:bottom
argument specifies where to begin the fixed point
computation.
The auxiliary procedure parse-null
parses a grammar
with respect to the empty string, also using a fixed point:
(define/fix (parse-null l) #:bottom (set) (match l [(empty) (set)] [(eps* S) S] [(eps) (set l)] [(token _ _) (set)] [(orp l1 l2) (set-union (parse-null l1) (parse-null l2))] [(seqp l1 l2) (for*/set ([t1 (parse-null l1)] [t2 (parse-null l2)]) (cons t1 t2))] [(redp l1 f) (for/set ([t (parse-null l1)]) (f t))]))
The implementation of parse
is short and sweet:
(define (parse l s) (cond [(stream-null? s) (parse-null l)] [else (parse (parse-derive (stream-car s) l) (stream-cdr s))]))
Ambiguity: Not a problem
Russ Cox claimed in his post that the following ambiguous grammar would blow up our implementation:
S ::= S + S | 1
His argument was that our algorithm was somehow doing backtracking.
When we tried it out (valid and invalid inputs), it worked fine:
(define good-input '(N + N + N ... + N)) (define bad-input '(N + N + N ... + + N)) (display (format "good: ~s~n" (length good-input))) (display (format "bad: ~s~n" (length bad-input))) (time (recognize? S (list->stream good-input))) (time (recognize? S (list->stream bad-input)))
returns:
good: 101 bad: 102 cpu time: 73 real time: 76 gc time: 0 #t cpu time: 43 real time: 44 gc time: 0 #f
Times are measured in milliseconds.
Update: To be clear, Russ is right that trying to non-lazily compute all the parse trees is inherently exponential.
What this example shows is that derivatives can still accept/reject without considering all possible parse trees.
And, if parse trees are computed lazily, you don't have to consider them all.
(Or, you can use Ruzzo's 1979 algorithm to convert a recognizer into a parser with only a logarithmic penalty in the size of the input.)
Parsing with derivatives is different enough from other methods that in order to understand it, you really have to play around with it.
Even my intuitions are still trumped by experience with the implementation.
The Racket and Haskell implementations each include a dot-file renderer that allows insight-building visualizations of derived grammars.
Compaction
The biggest development since the draft is compaction.
Compaction shrinks the size of the grammar, and it can eliminate recursive references which would otherwise stick around, hogging memory.
The draft does mention simplifying reductions on grammars.
For example, ∅ ○ A = ∅ and B ∪ ∅ = B.
If these simplifying reductions are performed recursively and memoized, you end up with a tight compactor for grammars:
; Note: nullp matches languages which are exactly ; the empty string--not languages which just contain ; the empty string. nullablep matches those. (define/memoize (compact [l #:eq]) (match l [(empty) l] [(eps) l] [(emptyp) (empty)] [(nullp) (eps* (parse-null l))] [(token p c) l] [(orp (emptyp) l2) (compact l2)] [(orp l1 (emptyp)) (compact l1)] [(seqp (nullp t) l2) (red (compact l2) (lambda (w2) (cons t w2)))] [(seqp l1 (nullp t)) (red (compact l1) (lambda (w1) (cons w1 t)))] [(orp l1 l2) (alt (compact l1) (compact l2))] [(seqp l1 l2) (cat (compact l1) (compact l2))] [(redp (and e (nullp)) f) ; => (eps* (for/set ([t (parse-null e)]) (f t)))] [(redp (seqp (nullp t) l2) f) ; => (red (compact l2) (lambda (w2) (f (cons t w2))))] [(redp (redp l f) g) ; => (red (compact l) (lambda (w) (g (f w))))] [(redp l f) (red (compact l) f)]))
Interleaving compaction and derivation keeps grammars roughly the same size throughout parsing.
Without compaction, these are the sizes of the grammar (and memory usage) after each derivative on a simple list of about 100 tokens:
size: 18; mem: 35411644 size: 34; mem: 35425164 size: 56; mem: 35443372 size: 84; mem: 35469044 size: 118; mem: 35504188 size: 158; mem: 35553100 size: 204; mem: 35611372 // ... about 60 derivatives later. size: 261650; mem: 129317020 size: 272359; mem: 133152924 size: 283338; mem: 137075028 size: 294590; mem: 141042592 size: 306118; mem: 144652724 size: 317925; mem: 148514052 size: 330014; mem: 152762344 size: 342388; mem: 157118524 size: 355050; mem: 161558660 ^Cuser breakI had to quit after about 75 derivatives, because my machine was locked up.
With compaction turned on, these are the sizes of the grammar (and memory usage) after each derivative:
size: 12; mem: 217035240 size: 15; mem: 217079664 size: 18; mem: 217115032 size: 21; mem: 217154032 size: 24; mem: 217202808 size: 27; mem: 217256200 size: 30; mem: 217317536 size: 33; mem: 217392216 size: 36; mem: 217471024 size: 39; mem: 217565616 size: 42; mem: 217657380 size: 45; mem: 217760612 size: 43; mem: 217877292 size: 40; mem: 218010620 size: 37; mem: 218113580 size: 34; mem: 218194388 size: 31; mem: 218271924 size: 28; mem: 218337300 size: 25; mem: 218406196 size: 22; mem: 218455836 size: 19; mem: 218500164 size: 16; mem: 218536804 size: 16; mem: 218589524 size: 17; mem: 218640748 size: 17; mem: 218703388 // ... (stays 17 for 94 more deriavtives) size: 17; mem: 222544772 size: 17; mem: 222585892 size: 19; mem: 222620636 size: 22; mem: 222661852 size: 25; mem: 222707660 size: 23; mem: 222758316 size: 20; mem: 222810180 size: 17; mem: 222849524 size: 13; mem: 222897964
And, this time, it returned interactively.
Derivatives not only add a lot of structure in a grammar, they also invalidate a lot of structure and make it available for simplification.
Since the net gain or loss in nodes is usually small, this suggests optimizations that reuse recently deactivated nodes.
In particular, the start node often appears only at the top level.
So, much of the time, it should be safe to destructively transform it into the next derivative, thereby avoiding allocation and deallocation altogether.
A port of this technique to non-garbage-collected languages will certainly want to exploit this fact.
Related work
Total Parser Combinators by Nils Danielsson appeared at ICFP 2010.
Danielsson also uses Brzozowski's derivative, but somewhat differently, and toward a different end: proving termination for parsers.
The key differences between our work and Danielsson's are:
- Danielsson's approach requires manual annotation of grammars with delays (sharps) and forces (flats) to guarantee termination. We are viscerally against annotations, like shift/reduce or sharp/flat, that require the user to understand how a parsing library was implemented. We require no such annotations.
- Danielsson doesn't handle all grammars. (Some kinds of left recursion don't work.) We handle all grammars, even infinitely recursive ones.
- Danielsson doesn't simplify or compact derived grammars, leading to terrible performance in practice.
- Because of sharp/flat annotations, Danielsson's definition of the derivative for sequences is markedly more complex.
One ESOP reviewer, commenting on what was then an unpublished draft of what would later evolve into Danielsson's submission to ICFP, noted that there should be room for both papers should Danielsson's be published. We fully agree, which is why we'll be trying again with ours.
How you can help kill yacc
What we need now are more implementations in other languages and "realistic" benchmarks in those implementations.
If you put one together or build some benchmarks, please send it our way.
In our own implementations, we've noticed that execution becomes dominated by garbage collection time as inputs grow larger.
Our diagnosis is a lack of sharing among derived grammars.
The same sub-grammar will reappear frequently during parsing, but we fail to discover this automatically. Instead, our implementations duplicate and release lots of tiny data structures, which stresses the garbage collector.
We (David, I and now also Daniel Spiewak) are working on several solutions to the "sharing" issue, but we'd love to hear from if you have an idea as well.
More resources
- An old post of mine on derivatives for regex matching.
- My nonblocking lexing toolkit based on derivatives.
- Parsing Techniques: A Practical Guide. I've been reading this compendium a lot lately, trying to get up to speed on all that's been done in parsing. This book has been indispensible.