Lexical analysis in Racket

[article index] [] [@mattmight] [rss]

In compilers and interpreters, lexers transform source code (a sequence of characters) into a sequence of tokens.

By stripping out insignificant characters (often whitespace and comments), the lexer is the first increase in the level of abstraction.

For instance, a lexer might transform:

  3 + (x * 10)

into:

(NUM 3)
(OP +)
(LPAR)
(IDENT x)
(OP *)
(NUM 10)
(RPAR)

Racket provides facilities for creating lexers from specifications with minimal effort.

Lexers in Racket are unusually powerful because the regular expressions used to describe classes of tokens may use non-standard regular operators like difference, complement and intersection.

The article below covers:

  • basic lexers in Racket;
  • SRE-style regular expressions;
  • a simple lexer for the "calculator" language;
  • multi-lexer lexers (for handling comments);
  • how to "unget" a character from a port; and
  • how to lexically analyze indentation.

Update: Reworked and added significantly more material Jan 2015.

Importing the lexer package

To get access to the lexer package, require parser-tools/lex:

(require parser-tools/lex)

All remaining examples assume this has been imported.

Lexers in Racket

In Racket, a lexer is a sequence of rules; each rule is a trigger and an action:

 (lexer [trigger action] ...)

When evaluated, a lexer form like this becomes a procedure.

The procedure takes an input port (what you get from opening a file), and it matches the input against the triggers.

When it finds the right matching trigger, it returns the result of the action for that trigger.

The longest prefix of the input that can be matched will be matched.

If two trigger patterns both match the longest prefix, then the top-most trigger wins.

Example

Characters act as their own patterns, so the following lexer is valid:

(define ab-lexer 
 (lexer 
   [#\a  (display "You matched a.\n")]
   [#\b  (display "You matched b.\n")]))

We can create an input port from a string:

(define ab-test-in (open-input-string "abba"))

Then, we can run the lexer:

 $ (ab-lexer ab-test-in)
 You matched a.

 $ (ab-lexer ab-test-in)
 You matched b.

 $ (ab-lexer ab-test-in)
 You matched b.

 $ (ab-lexer ab-test-in)
 You matched a.

 $ (ab-lexer ab-test-in)
 'eof

Example

Here's another lexer that grabs characters and prints each out, ignoring whitespace:

(require parser-tools/lex)

(define (string->char s)
  (car (string->list s)))

(define lex
  (lexer
    ; skip spaces:
    [#\space     (lex input-port)]
    
    ; skip newline:
    [#\newline   (lex input-port)]
   
    ; an actual character:
    [any-char    (list 'CHAR (string->char lexeme))]))

And, then it can run:

(define in (open-input-string "foo

 bar baz"))

(lex in) ; '(CHAR #\f)
(lex in) ; '(CHAR #\o)
(lex in) ; '(CHAR #\o)
(lex in) ; '(CHAR #\b)
(lex in) ; '(CHAR #\a)
(lex in) ; '(CHAR #\r)
(lex in) ; '(CHAR #\b)
(lex in) ; '(CHAR #\a)
(lex in) ; '(CHAR #\z)
(lex in) ; 'eof

Triggers

There is a core language for describing triggers in Racket lexers that corresponds to regular expressions.

There are also alternative trigger languages that compile to the core trigger language. These alternative languages are more concise.

The core language has special forms and regular expressions:

trigger	::= re
         |  (eof)
 	 	 	 	 
re ::= id
    |  string
    |  character
    |  (repetition lo hi re)
    |  (union re ...)
    |  (intersection re ...)
    |  (complement re)
    |  (concatenation re ...)
    |  (char-range char char)
    |  (char-complement re)
    |  (id datum ...)

Most of these rules correspond to their classic meaning in regular expressions. Surprising additions include intersection and complement.

(The extra power of the Racket lexer comes from its use of derivatives of regular expressions internally.)

Abbreviations

The pattern id matches a previously-defined abbreviation. You can define abbreviations with:

 (define-lex-abbrev id re)

Racket provides several useful abbreviations.

The abbrevation any-char matches any character, while the abbreviation nothing matches no strings at all.

The following abbreviations match sets of characters:

  alphabetic
  lower-case
  upper-case
  title-case
  numeric
  symbolic
  punctuation
  graphic
  whitespace
  blank
  iso-control

Expanded forms

Racket also provides a way to write macro-expanders for pattern forms:

 (define-lex-trans id transformer)

When Racket sees the form (id datum ...), it runs the expander for id.

There is rarely a need to write your own expander. This functionality exists primarily to allow packages to create their own syntax for triggers.

Actions

The action code for a rule in Racket may reference special variables like start-pos, end-pos, lexeme and input-port.

Positional data is in start-pos and end-pos, while lexeme contains the matched text as a string.

The port just read by the lexer is in input-port. This is useful when calling the lexer recursively on itself to continue, or when calling a different lexer.

A basic lexer

The following lexer looks for tokens that match the regex [a-z]+ and prints them out, ignoring intervening whitespace.

(define basic-printing-lexer
  (lexer 
   [(repetition 1 +inf.0 
                (char-range #\a #\z))
    ; =>
    (begin (display "found an id: ")
           (display lexeme)
           (newline))]
   
   [(union #\space #\newline)
    ; =>
    (void)]))

The core trigger language is verbose: the regex [a-z]+ became (repetition 1 +inf.0 (char-range #\a #\z)).

Even so, we can repeatedly call this lexer on a port until it prints out all the tokens:

(define (run-basic-printing-lexer port)
  (when (not (eq? 'eof (basic-printing-lexer port)))
    (run-basic-printing-lexer port)))

so that:

 $ (run-basic-printing-lexer (open-input-string "foo    bar baz"))
 found an id: foo
 found an id: bar
 found an id: baz

It's convenient to call the lexer once, yet get all of the tokens in the input.

To accomplish that, we have the lexer return a list of tokens rather than print them out one by one.

The lexer will also have to call itself recursively to continue processing:

(define basic-lexer
  (lexer 
   [(repetition 1 +inf.0
                (char-range #\a #\z))
    ; =>
    (cons `(ID ,lexeme)
          (basic-lexer input-port))]
   
   [(union #\space #\newline)
    ; =>
    (basic-lexer input-port)]
   
   [(eof)
    ; =>
    '()]))

Now, one call returns all tokens:

 $ (basic-lexer (open-input-string "foo bar   baz"))
 '((ID "foo") (ID "bar") (ID "baz"))

SRE triggers

You can import Olin Shivers's S-Expression regular expression (SRE) syntax for use with Racket lexers.

To import this syntax, use:

 (require parser-tools/lex-sre)

This will add the following pattern forms:

 (* re ...)
 (+ re ...)
 (? re ...)
 (= n re ...) 
 (>= n re ...)
 (** n m re ...)
 (or re ...)
 (: re ...)
 (seq re ...)
 (& re ...)
 (- re ...)
 (~ re ...)
 (/ char-or-string ...)

Most of them have the expected meaning, and for new ones, see the documentation on SRE forms.

Since this imports names like * that clash with common operators, you probably want a prefixed import:

 (require (prefix-in : parser-tools/lex-sre))

This imports all of the above names prefixed with :.

SRE Examples

Assuming the character : for the prefix, it's possible to translate UNIX/Perl-style regular expressions into this format:

regex SRE
foo "foo"
foo|bar (:or "foo" "bar")
(foo)* (:* "foo")
(foo|bar)* (:* (:or "foo" "bar"))
foo(foo|bar)* (:: "foo" (:* (:or "foo" "bar")))

The calc lexer

The calc language is the language of simple arithmetic expressions.

The following are all examples of calc expressions:

  x + 3*(y + 1)
  a*x + b*z
  3 * 12

Lexically, its token classes are numbers, identifiers, delimiters like parentheses and operators. Whitespace is ignored.

To convert these to a token stream, we need regular descriptions of each class:

 identifiers:  [a-zA-Z]+
 delimiters:   "("|")"
 operators:    "+"|"*"
 integers:     -?[0-9]+
 whitespace:   [ \n]+

Next, encode the spec as a lexer:

(define calc-lexer
  (lexer
   [(:+ (:or (char-range #\a #\z) (char-range #\A #\Z)))
    ; =>
    (cons `(ID ,(string->symbol lexeme))
          (calc-lexer input-port))]
   
   [#\( 
    ; =>
    (cons '(LPAR)
          (calc-lexer input-port))]
   
   [#\)
    ; =>
    (cons '(RPAR) 
          (calc-lexer input-port))]
   
   [(:: (:? #\-) (:+ (char-range #\0 #\9)))
    ; =>
    (cons `(INT ,(string->number lexeme))
          (calc-lexer input-port))]
   
   [(:or #\+ #\*)
    ; =>
    (cons `(OP ,(string->symbol lexeme))
          (calc-lexer input-port))]
   
   [whitespace 
    ; =>
    (calc-lexer input-port)]
   
   [(eof)
    '()]))

And, run it:

 $ (calc-lexer (open-input-string "-3 * (foo + 12)"))
 '((INT -3) (OP *) (LPAR) (ID foo) (OP +) (INT 12) (RPAR))

Multiple lexers

Using multiple states inside a lexer is convenient for analyzing complex comments and string literals.

In a tool like flex, one can define different states for the lexical analyzer using the directives %s and %x.

To mimic the effect of multiple states in a Racket lexer, you need multiple Racket lexers.

Each lexer is, in effect, its own state.

Calling from one lexer to another "changes" lexical analysis states.

Example

To illustrate the concept of multiple states, consider adding C-style /* */ comments to the calc language.

Rather than describe comments as a regular expression, we can change to a new lexer when we see /*:

(define calc+-lexer
  (lexer
   [(:+ (:or (char-range #\a #\z) (char-range #\A #\Z)))
    ; =>
    (cons `(ID ,(string->symbol lexeme))
          (calc+-lexer input-port))]
   
   
   [(:: #\/ #\*)
    ; =>
    (comment-lexer input-port)]
   
   [#\( 
    ; =>
    (cons '(LPAR)
          (calc+-lexer input-port))]
   
   [#\)
    ; =>
    (cons '(RPAR) 
          (calc+-lexer input-port))]
   
   [(:: (:? #\-) (:+ (char-range #\0 #\9)))
    ; =>
    (cons `(INT ,(string->number lexeme))
          (calc+-lexer input-port))]
   
   [(:or #\+ #\*)
    ; =>
    (cons `(OP ,(string->symbol lexeme))
          (calc+-lexer input-port))]
   
   [whitespace 
    ; =>
    (calc+-lexer input-port)]
   
   [(eof)
    '()]))

(define comment-lexer 
  (lexer
   [(:: #\* #\/)
    (calc+-lexer input-port)]
   
   [any-char
    (comment-lexer input-port)]))

Ungetting characters

It may be useful to "unget" a character from a port.

If the port was based on a file (not stdin!) or created with open-input-string, then the position can be set backward.

To move back a single character, use the procedure unget:

(define (unget port)
  (file-position port (- (file-position port) 1)))

Analyzing indentation

For languages like Python, it is important to handle indentation.

It's reasonable to use two lexers in Racket to handler indentantion -- one lexer for indentation, and another lexer for after the indentation has been seen.

To make this work well, it may be helpful to put a character back on the input stream once the end of an indentation has been seen.

The following lexer does all of this to print out the amount of indentation at the start of each line:

(require parser-tools/lex)

; set a port back one char:
(define (unget port)
  (file-position port (- (file-position port) 1)))

; count spaces:
(define space-count 0)

(define (inc-space!)
  (set! space-count (+ space-count  1)))

(define (reset-spaces!)
  (set! space-count 0))

; a lexer for measuring indentation:
(define indent-lex
  (lexer
    ; end of file:
    [(eof)      (void)]

    ; count the spaces:
    [#\space    (begin
                  (inc-space!)
                  (indent-lex input-port))]

    ; return to normal lexing:
    [any-char   (begin
                  (unget input-port)
                  (display "")
                  (reset-spaces!)
                  (lex input-port))]))
     

; a lexer for printing characters:
(define lex
  (lexer
    ; end:
    [(eof)       (void)]

    ; skip newline:
    [#\newline   (begin (newline)
                        (indent-lex input-port))]
   
    ; an actual character:
    [any-char    (begin (display lexeme)
                        (lex input-port))]))


So that:

(define in (open-input-string "foo

   bar
  baz
"))

(indent-lex in)

prints out:

<indent 0>foo
<indent 0>
<indent 3>bar
<indent 2>baz

A word on stack usage

Some lexers in this post called themselves in a non-tail recursive position.

While this works for small inputs, it can overflow the stack on larger ones.

Production lexers should always return the next token and expect to be called again later.

Matthew Flatt dropped by my office to correct me on this. In Racket, the natural way to write the the code works just fine; he says:

 but there's no such thing as "stack overflow" in Racket.

 The `calc+-lexer', for example, will nest calls N deep to 
 build an O(N) data structure --- nothing wrong with that
 at any scale.

 I other words, if you need to represent a continuation 
 to build up the  size-N data structure, a size-N Racket
 continuation is a good choice, not a bad one.

Related pages