Desugaring regular operations in context-free grammars

Parsing tools like yacc simplify the tasking of creating a parser.

But, most don’t simplify it as much as they could.

Tools like yacc don’t accept regular operations (like option and repetition) in their context-free grammar rules.

And, there are no special operations to handle common patterns like comma-separated (or anything-separated) lists.

This need not be the case.

Derivative parsers already easily handle regular operations, and compiling regular operations to yacc- or BNF-style grammars can bring these operations to other tools.

In this post, I’ll describe how to compile away regular (and other) operations in context-free grammars, and I’ll provide a tool – derp2 – to perform this compilation.

Derp2 targets and is ready to use directly with Racket’s parser-tools package.

Regular operations

Parsing tools often require grammars to be in a strict BNF-like format, where each rule contains only a sequence of terminals and non-terminals on the right-hand side.

Yet, in reality, language specifications, such as the Python language specification, contain both regular operations and nested operations.

Common and useful regular operations include:

sequence,
alternation/choice,
repetition and
option.

Consider the definition of Python parameter lists, which makes full use of all of these operations:

varargslist: (vfpdef ['=' test] (',' vfpdef ['=' test])* [','
              [ '*' [vfpdef] (',' vfpdef ['=' test])* [',' '**' vfpdef]
              | '**' vfpdef ]]
           |  '*' [vfpdef] (',' vfpdef ['=' test])* [',' '**' vfpdef]
           | '**' vfpdef)

Specifying this rule without these operations would be cumbersome.

Grammars with regular operations

It’s easy to define an S-Expressed format for context-free grammars that allows all of the regular operations:

<grammar> ::= (<rule>*)

<rule> ::= [<non-terminal> <pat>]

<pat> ::= (seq <pat>*)         # sequence of expressions
       |  (or <pat>*)          # any of these
       |  (opt <pat>)          # 0 or 1 instances
       |  (rep <pat>)          # 0 or more instances
       |  (rep+ <pat>)         # 1 or more instances
       |  <terminal>
       |  <non-terminal>

<non-terminal> ::= <symbol>

<terminal> ::= '<symbol>       # a class of terminals, e.g. 'NUM
            |  <string>        # an exact terminal, e.g. "if"

For instance, the rule for if statements in Python translates from:

if_stmt: 'if' test ':' suite ('elif' test ':' suite)* ['else' ':' suite]

into:

 [if_stmt     (seq "if" test ":" suite
                   (rep (seq "elif" test ":" suite))
                   (opt (seq "else" ":" suite)))]

Desugaring regular operations

To work with standard parsing tools, the grammar needs to fit a simplified format, in which the right hand side contains only a choice among sequences:

<grammar> ::= (<rule>*)

<rule> ::= [<non-terminal> (or <rhs>*)]

<rhs> ::= (seq <pat>*)  

<pat> ::= <terminal>
       |  <non-terminal>

<non-terminal> ::= <symbol>

<terminal> ::= '<symbol>       # a class of terminals, e.g. 'NUM
            |   <string>       # an exact terminal, e.g. "if"

Desugaring option

To desugar an option pattern like:

(opt <pat>)

it becomes a new non-terminal like $opt1, with an additional rule:

[$opt1    (or (seq <pat>)
              (seq))]

Desugaring repetition

To desugar a possibly empty repetition pattern like:

(rep <pat>)

it becomes a new non-terminal like $rep1, with an additional rule:

[$rep1     (or (seq <pat> $rep1)
               (seq))]

while a strict repetition pattern like:

(rep+ <pat>)

becomes a new non-terminal like $rep2, with an additional rule:

[$rep2     (or (seq <pat> $rep1)
               (seq <pat>))]

Desugaring nested operations

Nested operations need to be lifted up to their own rule.

For instance, with a rule like:

[abc    (seq alpha beta (or gamma delta))]

it becomes:

[abc    (seq alpha beta $sub1)]

with an added rule:

[$sub1    (or (seq gamma)
              (seq delta))]

derp2: A parsing tool with regular operations

Racket offers two parsing packages:

parser-tools/yacc implements an LALR(1) parser-generator.
parser-tools/cfg-parser implements a parser-generator for arbitrary CFGs.

I’ve created a tool, derp2, that can compile grammars from a rich language with regular operations (and more) down to the grammar format expected by these tools.

(The original derp was based on derivative parsing.)

Patterns in the derp2 rules are much richer than the core parsing tools:

<pat> ::=

    <terminal>
 |  <non-terminal>

 # Core regular forms:
 |  (seq <pat>*)           # Returns list of values.
 |  (or <pat>*)            # Returns the matching value.
 |  (opt <pat>)            # Returns value of <pat> or #f.
 |  (opt <pat> <exp>)      # Returns value of <pat> or <exp>.
 |  (rep <pat>)            # Returns list of values.
 |  (rep+ <pat>)           # Returns list of values.
 
 # Reductions:
 |  ($--> <pat> <body>)     # Binds the value of <pat> to $$
                            # Binds ($ n) to nth element of $$
                            # Return the result of <body>

 |  (@--> <pat> <exp>)      # Apply <exp> to <pat>

 |  (>--> <pat> <clauses>)  # Match value of <pat> with <clauses>

 |  ($*--> <pat>* <exp>)    # Bind $$ to values from <pat>*
                            # Bind ($ n) to nth value in <pat>
                            # Return result of <exp>

 # Implicit reductions:
 |  (cons <pat> <pat>)    # Same as (@--> (seq <pat> <pat>) cons)
 |  (car <pat>)           # Same as (@--> (seq <pat>) car)
 |  (cdr <pat>)           # Same as (@--> (seq <pat>) cdr)

 # Special compound forms:
 |  (rep/sep <pat> <pat> <bool>)   # Parses lists of second pattern
                                   # separated by first pattern,
                                   # optionally allowing a trailing
                                   # copy of the separating pattern.
                                   # May match nothing.

 |  (rep+/sep <pat> <pat> <bool>)  # Same as above, but one or more
                                   # Copies must match.

 # Selective sequence:
 |  #'(<sq-pat>*)       # Like seq, but only retains
                        # values marked with #,

<sq-pat> ::= #,<pat>
          |  <pat>

<non-terminal> ::= <symbol>

<terminal> ::= '<symbol>       # a class of terminals, e.g. 'NUM
            |  <string>        # an exact terminal, e.g. "if"

Example: A calculator parser

The full code for this example is available on github.

The grammar for the calculator language is:

{[exp      ($--> (seq term (rep (seq (or "+" "-") term)))
                 (process-binops ($ 1) ($ 2)))]

 [term     ($--> (seq factor (rep (seq (or "*" "/") factor)))
                 (process-binops ($ 1) ($ 2)))]

 [factor   (or ($--> (seq "(" exp ")")
                     ($ 2))
               'NUM)]}

and with process-binops defined:

(define (process-binops base ops)
  (match ops 
    ['()
     base]

    [(cons (list op exp) rest)
     (process-binops `(,(string->symbol op) ,base ,exp) rest)]))

this grammar will parse inputs like:

10 * (3 + 4 /5)