Implementing regular expression pattern-matching and nondeterministic finite automata (NFAs) in Java
Back when I was a grad student, I gave a guest lecture on regular expressions. I gave out a toy Java library that showed how to implement regular expression pattern-matching in Java using NFA-conversion and back-tracking search.
The library works by building the syntax for regular expressions out
of Java syntax.
A regular expression pattern
r can be:
- A literal string:
- A literal character:
- A sequence of patterns:
- A choice of patterns:
- A zero-or-more repetition of a pattern:
The library works by converting each regular expression into an NFA on the fly. For whatever reason, I didn't implement it in a functional fashion, so you can't re-use a sub-expression in more than one regular expression. That is, the following breaks:
NFA foo = s("foo") ; NFA pattern = s(foo,foo) ;but
NFA pattern = s(s("foo"),s("foo")) ; // or s("foo","foo")is just fine. I don't code much in Java these days, but if I still did, I'd make it purely functional, so that sub-expressions could be re-used.
The matching algorithm works by doing a straightforward graph search over a string. The complexity is exponential in theory, but very fast in practice.