A regular expression matcher in Scheme using derivatives

[article index] [] [@mattmight] [rss]

Many assume that pattern-matching a regular expression requires NFA conversion followed by back-tracking search or DFA conversion. Using the derivative of regular expressions, it's possible to write a simple pattern-matcher without NFA conversion or back-tracking.

The derivative of a regular expression is an algebraic manipulation of the regular expression that calculates its "partially matched" tail with respect to a particular character.

Read on to see how it's done, with examples in Scheme.

More resources

Related blog articles:

Papers:

Books:

  • MIT prof Michael Sipser's Theory of Computation covers regular languages and automata in exhaustive detail in the very first chapter.

Derivatives of regular expressions

The derivative of a regular expression with respect to a character computes a new regular expression that matches what the original expression would match, assuming it had just matched the character.

For example, the derivative of the expression (foo|frak)* with respect to the character f is the expression (oo|rak)(foo|frak)*.

On the other hand, the derivative of the expression (foo|frak)* with respect to the character c is the null pattern ∅. The null pattern is not allowed in most regular expression implementations, but it is necessary in order to make the derivative a total function. By definition, no string can match the null pattern.

The matching strategy with derivatives is straightforward:

  1. If the string to match is empty and the current pattern matches empty, then the match succeeds.
  2. If the string to match is non-empty, the new pattern is the derivative of the current pattern with respect to the first character of the current string, and the new string to match is the remainder of the current string.

To define the derivative, we first need a function δ that returns the empty string ε if its argument accepts the empty string, and the null pattern ∅ otherwise:

  • δ(∅) = ∅
  • δ(ε) = ε
  • δ(c) = ∅
  • δ(re1 re2) = δ(re1) δ(re2)
  • δ(re1 | re2) = δ(re1) | δ(re2)
  • δ(re*) = ε

Let Dc(re) denote the derivative of the regular expression re with respect to the character c; then the derivative can be defined recursively:

  • Dc(∅) = ∅
  • Dc(ε) = ∅
  • Dc(c) = ε
  • Dc(c') = ∅ if cc'.
  • Dc(re1 re2) = δ(re1) Dc(re2) | Dc(re1) re2
  • Dc(re1 | re2) = Dc(re1) | Dc(re2)
  • Dc(re*) = Dc(re) re*

Code

I've implemented the derivative in Scheme as a quick demonstration. Without much code, you can actually create a functioning regular-expression matcher. The code is available: regex-derivative.scm.

; Author: Matthew Might 
; Site:   http://matt.might.net/

; In about 100 lines of Scheme, this file implements a 
; regular-expression matcher based on the derivative of
; regular expressions.

; The derivative of a regular expression 're' with respect
; to a character 'c' is a new regular expression which matches
; what the expression 're' would match if it first matched 'c'.

; For example, using POSIX notation, the derivative of 
; (foo|frak)* with respect to 'f' is (oo|rak)(foo|frak)*

; To account for the possibility that a regular expression
; may not match any string, including the empty string,
; regular expressions include an unmatchable pattern:

;  ::= #f                     ; Unmatchable pattern
;          |  #t                     ; Empty/blank pattern
;          |  '              ; Symbol
;          |  #\               ; Character
;          |  (alt  )  ; Alternation
;          |  (seq  )  ; Sequence
;          |  (rep )          ; Repetition

; Further reading:

; [1] Janusz Brzozowski. 
;     "Derivatives of Regular Expressions." 1964. 
; [2] Scott Owens, John Reppy, Aaron Turon. 
;     "Regular expression derivatives re-examined." 2009.


;; Special regular expressions.
(define regex-NULL #f)    ; -- the empty set
(define regex-BLANK #t)   ; -- the empty string

;; Predicates.
(define (regex-alt? re)
  (and (pair? re) (eq? (car re) 'alt)))

(define (regex-seq? re)
  (and (pair? re) (eq? (car re) 'seq)))

(define (regex-rep? re)
  (and (pair? re) (eq? (car re) 'rep)))

(define (regex-null? re)
  (eq? re #f))

(define (regex-empty? re)
  (eq? re #t))

(define (regex-atom? re)
  (or (char? re) (symbol? re)))

;; Regex deconstructors.
(define (match-seq re f)
  (and (regex-seq? re)
       (f (cadr re) (caddr re))))

(define (match-alt re f)
  (and (regex-alt? re)
       (f (cadr re) (caddr re))))

(define (match-rep re f)
  (and (regex-rep? re)
       (f (cadr re))))


;; Simplifying regex constructors.
(define (seq pat1 pat2)
  (cond
    ((regex-null? pat1) regex-NULL)
    ((regex-null? pat2) regex-NULL)
    ((regex-empty? pat1) pat2)
    ((regex-empty? pat2) pat1)
    (else (list 'seq pat1 pat2))))
     
(define (alt pat1 pat2)
  (cond
    ((regex-null? pat1) pat2)
    ((regex-null? pat2) pat1)
    (else (list 'alt pat1 pat2))))

(define (rep pat)
  (cond
    ((regex-null? pat) regex-BLANK)
    ((regex-empty? pat) regex-BLANK)
    (else (list 'rep pat))))

;; Matching functions.

; regex-empty : regex -> boolean
(define (regex-empty re)
  (cond
    ((regex-empty? re) #t)
    ((regex-null? re) #f)
    ((regex-atom? re) #f)
    ((match-seq re (lambda (pat1 pat2)
                     (seq (regex-empty pat1) (regex-empty pat2)))))
    ((match-alt re (lambda (pat1 pat2)
                     (alt (regex-empty pat1) (regex-empty pat2)))))
    ((regex-rep? re) #t)
    (else #f)))

; regex-derivative : regex regex-atom -> regex
(define (regex-derivative re c)
  (cond 
    ((regex-empty? re) regex-NULL)
    ((regex-null? re)  regex-NULL)
    ((eq? c re)        regex-BLANK)
    ((regex-atom? re)  regex-NULL)
    ((match-seq re     (lambda (pat1 pat2) 
                         (alt (seq (d/dc pat1 c) pat2)
                              (seq (regex-empty pat1) (d/dc pat2 c))))))
    ((match-alt re     (lambda (pat1 pat2)
                         (alt (d/dc pat1 c) (d/dc pat2 c)))))
    ((match-rep re     (lambda (pat)
                         (seq (d/dc pat c) (rep pat)))))
    (else regex-NULL)
    ))
                
; d/dc = regex-derivative
(define d/dc regex-derivative)

; regex-match : regex list -> boolean 
(define (regex-match pattern data)
  (if (null? data)
      (regex-empty? (regex-empty pattern))
      (regex-match (d/dc pattern (car data)) (cdr data))))

;; Tests.
(define (check-expect check expect)
  (if (not (equal? check expect))
      (begin (display "check-expect failed; got: ")
             (display check)
             (display "; expected: ")
             (display expect)
             (newline))))
       
(check-expect (d/dc 'baz 'f)
              #f)

(check-expect (d/dc '(seq foo barn) 'foo)
              'barn)

(check-expect (d/dc '(alt (seq foo bar) (seq foo (rep baz))) 'foo)
              '(alt bar (rep baz)))

(check-expect (regex-match '(seq foo (rep bar)) 
                           '(foo bar bar bar))
              #t)

(check-expect (regex-match '(seq foo (rep bar)) 
                           '(foo bar baz bar bar))
              #f)

(check-expect (regex-match '(seq foo (rep (alt bar baz))) 
                           '(foo bar baz bar bar))
              #t)