Regular operations
Parsing tools often require grammars to be in a strict BNF-like format, where each rule contains only a sequence of terminals and non-terminals on the right-hand side.
Yet, in reality, language specifications, such as the Python language specification, contain both regular operations and nested operations.
Common and useful regular operations include:
- sequence,
- alternation/choice,
- repetition and
- option.
Consider the definition of Python parameter lists, which makes full use of all of these operations:
varargslist: (vfpdef ['=' test] (',' vfpdef ['=' test])* [','
[ '*' [vfpdef] (',' vfpdef ['=' test])* [',' '**' vfpdef]
| '**' vfpdef ]]
| '*' [vfpdef] (',' vfpdef ['=' test])* [',' '**' vfpdef]
| '**' vfpdef)
Specifying this rule without these operations would be cumbersome.
Grammars with regular operations
It’s easy to define an S-Expressed format for context-free grammars that allows all of the regular operations:
<grammar> ::= (<rule>*)
<rule> ::= [<non-terminal> <pat>]
<pat> ::= (seq <pat>*) # sequence of expressions
| (or <pat>*) # any of these
| (opt <pat>) # 0 or 1 instances
| (rep <pat>) # 0 or more instances
| (rep+ <pat>) # 1 or more instances
| <terminal>
| <non-terminal>
<non-terminal> ::= <symbol>
<terminal> ::= '<symbol> # a class of terminals, e.g. 'NUM
| <string> # an exact terminal, e.g. "if"
For instance, the rule for if
statements in Python
translates from:
if_stmt: 'if' test ':' suite ('elif' test ':' suite)* ['else' ':' suite]
into:
[if_stmt (seq "if" test ":" suite
(rep (seq "elif" test ":" suite))
(opt (seq "else" ":" suite)))]
Desugaring regular operations
To work with standard parsing tools, the grammar needs to fit a simplified format, in which the right hand side contains only a choice among sequences:
<grammar> ::= (<rule>*)
<rule> ::= [<non-terminal> (or <rhs>*)]
<rhs> ::= (seq <pat>*)
<pat> ::= <terminal>
| <non-terminal>
<non-terminal> ::= <symbol>
<terminal> ::= '<symbol> # a class of terminals, e.g. 'NUM
| <string> # an exact terminal, e.g. "if"
Desugaring option
To desugar an option pattern like:
(opt <pat>)
it becomes a new non-terminal like $opt1
, with an additional rule:
[$opt1 (or (seq <pat>)
(seq))]
Desugaring repetition
To desugar a possibly empty repetition pattern like:
(rep <pat>)
it becomes a new non-terminal like $rep1
, with an additional rule:
[$rep1 (or (seq <pat> $rep1)
(seq))]
while a strict repetition pattern like:
(rep+ <pat>)
becomes a new non-terminal like $rep2
, with an additional rule:
[$rep2 (or (seq <pat> $rep1)
(seq <pat>))]
Desugaring nested operations
Nested operations need to be lifted up to their own rule.
For instance, with a rule like:
[abc (seq alpha beta (or gamma delta))]
it becomes:
[abc (seq alpha beta $sub1)]
with an added rule:
[$sub1 (or (seq gamma)
(seq delta))]
derp2: A parsing tool with regular operations
Racket offers two parsing packages:
parser-tools/yacc
implements an LALR(1) parser-generator.parser-tools/cfg-parser
implements a parser-generator for arbitrary CFGs.
I’ve created a tool, derp2, that can compile grammars from a rich language with regular operations (and more) down to the grammar format expected by these tools.
(The original derp
was based on derivative parsing.)
Patterns in the derp2
rules are much richer than the core parsing tools:
<pat> ::=
<terminal>
| <non-terminal>
# Core regular forms:
| (seq <pat>*) # Returns list of values.
| (or <pat>*) # Returns the matching value.
| (opt <pat>) # Returns value of <pat> or #f.
| (opt <pat> <exp>) # Returns value of <pat> or <exp>.
| (rep <pat>) # Returns list of values.
| (rep+ <pat>) # Returns list of values.
# Reductions:
| ($--> <pat> <body>) # Binds the value of <pat> to $$
# Binds ($ n) to nth element of $$
# Return the result of <body>
| (@--> <pat> <exp>) # Apply <exp> to <pat>
| (>--> <pat> <clauses>) # Match value of <pat> with <clauses>
| ($*--> <pat>* <exp>) # Bind $$ to values from <pat>*
# Bind ($ n) to nth value in <pat>
# Return result of <exp>
# Implicit reductions:
| (cons <pat> <pat>) # Same as (@--> (seq <pat> <pat>) cons)
| (car <pat>) # Same as (@--> (seq <pat>) car)
| (cdr <pat>) # Same as (@--> (seq <pat>) cdr)
# Special compound forms:
| (rep/sep <pat> <pat> <bool>) # Parses lists of second pattern
# separated by first pattern,
# optionally allowing a trailing
# copy of the separating pattern.
# May match nothing.
| (rep+/sep <pat> <pat> <bool>) # Same as above, but one or more
# Copies must match.
# Selective sequence:
| #'(<sq-pat>*) # Like seq, but only retains
# values marked with #,
<sq-pat> ::= #,<pat>
| <pat>
<non-terminal> ::= <symbol>
<terminal> ::= '<symbol> # a class of terminals, e.g. 'NUM
| <string> # an exact terminal, e.g. "if"
Example: A calculator parser
The full code for this example is available on github.
The grammar for the calculator language is:
{[exp ($--> (seq term (rep (seq (or "+" "-") term)))
(process-binops ($ 1) ($ 2)))]
[term ($--> (seq factor (rep (seq (or "*" "/") factor)))
(process-binops ($ 1) ($ 2)))]
[factor (or ($--> (seq "(" exp ")")
($ 2))
'NUM)]}
and with process-binops
defined:
(define (process-binops base ops)
(match ops
['()
base]
[(cons (list op exp) rest)
(process-binops `(,(string->symbol op) ,base ,exp) rest)]))
this grammar will parse inputs like:
10 * (3 + 4 /5)
into:
(* 10 (+ 3 (/ 4 5)))