A regular expression matcher in Scheme using derivatives

[article index] [] [@mattmight] [+mattmight] [rss]

Many assume that pattern-matching a regular expression requires NFA conversion followed by back-tracking search or DFA conversion. Using the derivative of regular expressions, it's possible to write a simple pattern-matcher without NFA conversion or back-tracking.

The derivative of a regular expression is an algebraic manipulation of the regular expression that calculates its "partially matched" tail with respect to a particular character.

Read on to see how it's done, with examples in Scheme.

More resources

Related blog articles:

Papers:

Books:

  • MIT prof Michael Sipser's Theory of Computation covers regular languages and automata in exhaustive detail in the very first chapter.

Derivatives of regular expressions

The derivative of a regular expression with respect to a character computes a new regular expression that matches what the original expression would match, assuming it had just matched the character.

For example, the derivative of the expression (foo|frak)* with respect to the character f is the expression (oo|rak)(foo|frak)*.

On the other hand, the derivative of the expression (foo|frak)* with respect to the character c is the null pattern ∅. The null pattern is not allowed in most regular expression implementations, but it is necessary in order to make the derivative a total function. By definition, no string can match the null pattern.

The matching strategy with derivatives is straightforward:

  1. If the string to match is empty and the current pattern matches empty, then the match succeeds.
  2. If the string to match is non-empty, the new pattern is the derivative of the current pattern with respect to the first character of the current string, and the new string to match is the remainder of the current string.

To define the derivative, we first need a function δ that returns the empty string ε if its argument accepts the empty string, and the null pattern ∅ otherwise:

  • δ(∅) = ∅
  • δ(ε) = ε
  • δ(c) = ∅
  • δ(re1 re2) = δ(re1) δ(re2)
  • δ(re1 | re2) = δ(re1) | δ(re2)
  • δ(re*) = ε

Let Dc(re) denote the derivative of the regular expression re with respect to the character c; then the derivative can be defined recursively:

  • Dc(∅) = ∅
  • Dc(ε) = ∅
  • Dc(c) = ε
  • Dc(c') = ∅ if cc'.
  • Dc(re1 re2) = δ(re1) Dc(re2) | Dc(re1) re2
  • Dc(re1 | re2) = Dc(re1) | Dc(re2)
  • Dc(re*) = Dc(re) re*

Code

I've implemented the derivative in Scheme as a quick demonstration. Without much code, you can actually create a functioning regular-expression matcher:


[article index] [] [@mattmight] [+mattmight] [rss]