More resources
Related blog articles:
Papers:
- Brzozozwksi's original paper, "Derivatives of Regular Expressions."
- Owens, Reppy and Turon's, "Regular-expression derivatives re-examined."
Books:
- MIT prof Michael Sipser's Theory of Computation covers regular languages and automata in exhaustive detail in the very first chapter.
Derivatives of regular expressions
The derivative of a regular expression with respect to a character computes a new regular expression that matches what the original expression would match, assuming it had just matched the character.
For example, the derivative of the expression (foo|frak)*
with respect to the character f
is the expression
(oo|rak)(foo|frak)*
.
On the other hand, the derivative of the expression
(foo|frak)*
with respect to the character c
is the null pattern ∅. The null pattern is not allowed in most
regular expression implementations, but it is necessary in order to
make the derivative a total function. By definition, no string can
match the null pattern.
The matching strategy with derivatives is straightforward:
- If the string to match is empty and the current pattern matches empty, then the match succeeds.
- If the string to match is non-empty, the new pattern is the derivative of the current pattern with respect to the first character of the current string, and the new string to match is the remainder of the current string.
To define the derivative, we first need a function δ that returns the empty string ε if its argument accepts the empty string, and the null pattern ∅ otherwise:
- δ(∅) = ∅
- δ(ε) = ε
- δ(c) = ∅
- δ(re1 re2) = δ(re1) δ(re2)
- δ(re1 | re2) = δ(re1) | δ(re2)
- δ(re*) = ε
- Dc(∅) = ∅
- Dc(ε) = ∅
- Dc(c) = ε
- Dc(c') = ∅ if c ≠ c'.
- Dc(re1 re2) = δ(re1) Dc(re2) | Dc(re1) re2
- Dc(re1 | re2) = Dc(re1) | Dc(re2)
- Dc(re*) = Dc(re) re*
Code
I've implemented the derivative in Scheme as a quick demonstration. Without much code, you can actually create a functioning regular-expression matcher: