Importing the lexer package
To get access to the lexer package,
require parser-tools/lex
:
(require parser-tools/lex)
All remaining examples assume this has been imported.
Lexers in Racket
In Racket, a lexer is a sequence of rules; each rule is a trigger and an action:
(lexer [trigger action] ...)
When evaluated, a lexer form like this becomes a procedure.
The procedure takes an input port (what you get from opening a file), and it matches the input against the triggers.
When it finds the right matching trigger, it returns the result of the action for that trigger.
The longest prefix of the input that can be matched will be matched.
If two trigger patterns both match the longest prefix, then the top-most trigger wins.
Example
Characters act as their own patterns, so the following lexer is valid:
(define ab-lexer (lexer [#\a (display "You matched a.\n")] [#\b (display "You matched b.\n")]))
We can create an input port from a string:
(define ab-test-in (open-input-string "abba"))
Then, we can run the lexer:
$ (ab-lexer ab-test-in) You matched a. $ (ab-lexer ab-test-in) You matched b. $ (ab-lexer ab-test-in) You matched b. $ (ab-lexer ab-test-in) You matched a. $ (ab-lexer ab-test-in) 'eof
Example
Here's another lexer that grabs characters and prints each out, ignoring whitespace:
(require parser-tools/lex) (define (string->char s) (car (string->list s))) (define lex (lexer ; skip spaces: [#\space (lex input-port)] ; skip newline: [#\newline (lex input-port)] ; an actual character: [any-char (list 'CHAR (string->char lexeme))]))
And, then it can run:
(define in (open-input-string "foo bar baz")) (lex in) ; '(CHAR #\f) (lex in) ; '(CHAR #\o) (lex in) ; '(CHAR #\o) (lex in) ; '(CHAR #\b) (lex in) ; '(CHAR #\a) (lex in) ; '(CHAR #\r) (lex in) ; '(CHAR #\b) (lex in) ; '(CHAR #\a) (lex in) ; '(CHAR #\z) (lex in) ; 'eof
Triggers
There is a core language for describing triggers in Racket lexers that corresponds to regular expressions.
There are also alternative trigger languages that compile to the core trigger language. These alternative languages are more concise.
The core language has special forms and regular expressions:
trigger ::= re | (eof) re ::= id | string | character | (repetition lo hi re) | (union re ...) | (intersection re ...) | (complement re) | (concatenation re ...) | (char-range char char) | (char-complement re) | (id datum ...)
Most of these rules correspond to their classic meaning in regular expressions. Surprising additions include intersection and complement.
(The extra power of the Racket lexer comes from its use of derivatives of regular expressions internally.)
Abbreviations
The pattern id
matches a previously-defined abbreviation.
You can define abbreviations with:
(define-lex-abbrev id re)
Racket provides several useful abbreviations.
The abbrevation any-char
matches any character,
while the abbreviation nothing
matches no strings at all.
The following abbreviations match sets of characters:
alphabetic lower-case upper-case title-case numeric symbolic punctuation graphic whitespace blank iso-control
Expanded forms
Racket also provides a way to write macro-expanders for pattern forms:
(define-lex-trans id transformer)
When Racket sees the form (id datum ...)
,
it runs the expander for id
.
There is rarely a need to write your own expander. This functionality exists primarily to allow packages to create their own syntax for triggers.
Actions
The action code for a rule in Racket
may reference special
variables
like start-pos
,
end-pos
,
lexeme
and
input-port
.
Positional data is in start-pos
and end-pos
,
while lexeme
contains the matched text as a string.
The port just read by the lexer is in input-port
.
This is useful when calling the lexer recursively
on itself to continue, or when
calling a different lexer.
A basic lexer
The following lexer looks for tokens that
match the regex [a-z]+
and
prints them out, ignoring intervening whitespace.
(define basic-printing-lexer (lexer [(repetition 1 +inf.0 (char-range #\a #\z)) ; => (begin (display "found an id: ") (display lexeme) (newline))] [(union #\space #\newline) ; => (void)]))
The core trigger language is verbose: the regex [a-z]+
became (repetition 1 +inf.0 (char-range #\a #\z))
.
Even so, we can repeatedly call this lexer on a port until it prints out all the tokens:
(define (run-basic-printing-lexer port) (when (not (eq? 'eof (basic-printing-lexer port))) (run-basic-printing-lexer port)))
so that:
$ (run-basic-printing-lexer (open-input-string "foo bar baz")) found an id: foo found an id: bar found an id: baz
It's convenient to call the lexer once, yet get all of the tokens in the input.
To accomplish that, we have the lexer return a list of tokens rather than print them out one by one.
The lexer will also have to call itself recursively to continue processing:
(define basic-lexer (lexer [(repetition 1 +inf.0 (char-range #\a #\z)) ; => (cons `(ID ,lexeme) (basic-lexer input-port))] [(union #\space #\newline) ; => (basic-lexer input-port)] [(eof) ; => '()]))
Now, one call returns all tokens:
$ (basic-lexer (open-input-string "foo bar baz")) '((ID "foo") (ID "bar") (ID "baz"))
SRE triggers
You can import Olin Shivers's S-Expression regular expression (SRE) syntax for use with Racket lexers.
To import this syntax, use:
(require parser-tools/lex-sre)
This will add the following pattern forms:
(* re ...) (+ re ...) (? re ...) (= n re ...) (>= n re ...) (** n m re ...) (or re ...) (: re ...) (seq re ...) (& re ...) (- re ...) (~ re ...) (/ char-or-string ...)
Most of them have the expected meaning, and for new ones, see the documentation on SRE forms.
Since this imports names like *
that clash with common operators,
you probably want a prefixed import:
(require (prefix-in : parser-tools/lex-sre))
This imports all of the above names prefixed with :
.
SRE Examples
Assuming the character :
for the prefix,
it's possible to translate UNIX/Perl-style
regular expressions into this format:
regex | SRE |
foo |
"foo" |
foo|bar |
(:or "foo" "bar") |
(foo)* |
(:* "foo") |
(foo|bar)* |
(:* (:or "foo" "bar")) |
foo(foo|bar)* |
(:: "foo" (:* (:or "foo" "bar"))) |
The calc lexer
The calc language is the language of simple arithmetic expressions.
The following are all examples of calc expressions:
x + 3*(y + 1) a*x + b*z 3 * 12
Lexically, its token classes are numbers, identifiers, delimiters like parentheses and operators. Whitespace is ignored.
To convert these to a token stream, we need regular descriptions of each class:
identifiers: [a-zA-Z]+ delimiters: "("|")" operators: "+"|"*" integers: -?[0-9]+ whitespace: [ \n]+
Next, encode the spec as a lexer:
(define calc-lexer (lexer [(:+ (:or (char-range #\a #\z) (char-range #\A #\Z))) ; => (cons `(ID ,(string->symbol lexeme)) (calc-lexer input-port))] [#\( ; => (cons '(LPAR) (calc-lexer input-port))] [#\) ; => (cons '(RPAR) (calc-lexer input-port))] [(:: (:? #\-) (:+ (char-range #\0 #\9))) ; => (cons `(INT ,(string->number lexeme)) (calc-lexer input-port))] [(:or #\+ #\*) ; => (cons `(OP ,(string->symbol lexeme)) (calc-lexer input-port))] [whitespace ; => (calc-lexer input-port)] [(eof) '()]))
And, run it:
$ (calc-lexer (open-input-string "-3 * (foo + 12)")) '((INT -3) (OP *) (LPAR) (ID foo) (OP +) (INT 12) (RPAR))
Multiple lexers
Using multiple states inside a lexer is convenient for analyzing complex comments and string literals.
In a tool like flex, one can define
different states for the lexical analyzer
using the directives %s
and %x
.
To mimic the effect of multiple states in a Racket lexer, you need multiple Racket lexers.
Each lexer is, in effect, its own state.
Calling from one lexer to another "changes" lexical analysis states.
Example
To illustrate the concept of multiple states,
consider adding C-style /* */
comments to the calc language.
Rather than describe comments as a regular expression,
we can change to a new lexer when we see /*
:
(define calc+-lexer (lexer [(:+ (:or (char-range #\a #\z) (char-range #\A #\Z))) ; => (cons `(ID ,(string->symbol lexeme)) (calc+-lexer input-port))] [(:: #\/ #\*) ; => (comment-lexer input-port)] [#\( ; => (cons '(LPAR) (calc+-lexer input-port))] [#\) ; => (cons '(RPAR) (calc+-lexer input-port))] [(:: (:? #\-) (:+ (char-range #\0 #\9))) ; => (cons `(INT ,(string->number lexeme)) (calc+-lexer input-port))] [(:or #\+ #\*) ; => (cons `(OP ,(string->symbol lexeme)) (calc+-lexer input-port))] [whitespace ; => (calc+-lexer input-port)] [(eof) '()])) (define comment-lexer (lexer [(:: #\* #\/) (calc+-lexer input-port)] [any-char (comment-lexer input-port)]))
Ungetting characters
It may be useful to "unget" a character from a port.
If the port was based on a file (not stdin!) or created with
open-input-string
, then the position can be set backward.
To move back a single character, use the procedure unget
:
(define (unget port) (file-position port (- (file-position port) 1)))
Analyzing indentation
For languages like Python, it is important to handle indentation.
It's reasonable to use two lexers in Racket to handler indentantion -- one lexer for indentation, and another lexer for after the indentation has been seen.
To make this work well, it may be helpful to put a character back on the input stream once the end of an indentation has been seen.
The following lexer does all of this to print out the amount of indentation at the start of each line:
(require parser-tools/lex) ; set a port back one char: (define (unget port) (file-position port (- (file-position port) 1))) ; count spaces: (define space-count 0) (define (inc-space!) (set! space-count (+ space-count 1))) (define (reset-spaces!) (set! space-count 0)) ; a lexer for measuring indentation: (define indent-lex (lexer ; end of file: [(eof) (void)] ; count the spaces: [#\space (begin (inc-space!) (indent-lex input-port))] ; return to normal lexing: [any-char (begin (unget input-port) (display "") (reset-spaces!) (lex input-port))])) ; a lexer for printing characters: (define lex (lexer ; end: [(eof) (void)] ; skip newline: [#\newline (begin (newline) (indent-lex input-port))] ; an actual character: [any-char (begin (display lexeme) (lex input-port))]))
So that:
(define in (open-input-string "foo bar baz ")) (indent-lex in)
prints out:
<indent 0>foo <indent 0> <indent 3>bar <indent 2>baz
A word on stack usage
Some lexers in this post called themselves in a non-tail recursive
position.
While this works for small inputs, it can overflow the stack on larger ones.
Production lexers should always return the next token and expect to be called again later.
Matthew Flatt dropped by my office to correct me on this. In Racket, the natural way to write the the code works just fine; he says:
but there's no such thing as "stack overflow" in Racket. The `calc+-lexer', for example, will nest calls N deep to build an O(N) data structure --- nothing wrong with that at any scale. I other words, if you need to represent a continuation to build up the size-N data structure, a size-N Racket continuation is a good choice, not a bad one.