A regular expression matcher in Scheme using derivatives
Many assume that pattern-matching a regular expression requires NFA conversion followed by back-tracking search or DFA conversion. Using the derivative of regular expressions, it's possible to write a simple pattern-matcher without NFA conversion.
Related blog articles:
Papers:
- Brzozozwksi's original paper, "Derivatives of Regular Expressions."
- Owens, Reppy and Turon's, "Regular-expression derivatives re-examined."
Books:
-
MIT prof Michael Sipser's classic Theory of Computation
covers regular languages and automata in exhaustive detail in the very first chapter.
Derivatives of regular expressions
The derivative of a regular expression with respect to a character computes a new regular expression that matches what the original expression would match, assuming it had just matched the character.
For example, the derivative of the expression (foo|frak)*
with respect to the character f is the expression
(oo|rak)(foo|frak)*.
On the other hand, the derivative of the expression
(foo|frak)* with respect to the character c
is the null pattern ∅. The null pattern is not allowed in most
regular expression implementations, but it is necessary in order to
make the derivative a total function. By definition, no string can
match the null pattern.
The matching strategy with derivatives is straightforward:
- If the string to match is empty and the current pattern matches empty, then the match succeeds.
- If the string to match is non-empty, the new pattern is the derivative of the current pattern with respect to the first character of the current string, and the new string to match is the remainder of the current string.
To define the derivative, we first need a function δ that returns the empty string ε if its argument accepts the empty string, and the null pattern ∅ otherwise:
- δ(∅) = ∅
- δ(ε) = ε
- δ(c) = ∅
- δ(re1 re2) = δ(re1) δ(re2)
- δ(re1 | re2) = δ(re1) | δ(re2)
- δ(re*) = ε
- Dc(∅) = ∅
- Dc(∅) = ε
- Dc(c) = ε
- Dc(c') = ∅ if c ≠ c'.
- Dc(re1 re2) = δ(re1) Dc(re2) | Dc(re1) re2
- Dc(re1 | re2) = Dc(re1) | Dc(re2)
- Dc(re*) = Dc(re) re*