The tool lex
in a nutshell
The program lex
consumes a lexical specification
and generates C source code for a lexical analyzer.
Typically, regular expressions describe classes of tokens in a spec.
[See my post on sculpting text for a definition of regular expressions.]
For example, an informal lexical specification might be:
- identifiers match
[A-Za-z_][A-Za-z_0-9]*
; - integers match
-?[1-9][0-9]+
; =
,*
and+
are operators;;
,(
, and)
are delimiters;if
andprint
are keywords; and- whitespace is ignored.
In ordinary usage, lex generates a file that can be linked in with the rest of the compiler/interpreter toolchain.
To communicate with downstream phases,
the C code that lex
generates
exports the procedure yylex
:
when a downstream phase needs the next token,
it calls yylex()
.
Alternatively, the generated C program can be made standalone, in which case it dumps the result of its lexical analysis into a file or to standard out.
(In choosing a format to dump, it helps to pick one with widespread parsing library support, such as XML or S-Expressions.)
For example, for the code def f(x): return 3
,
the generated S-Expressions might be:
(keyword "def") (id "f") (delim "(") (id "x") (delim ")") (delim ":") (keyword "return") (int 3)
Using (f)lex
On most Unix systems, flex
is the
modern stand-in for lex
.
As such, this article covers flex
.
To use flex
, run it on
a specification file:
$ flex file
.l
By default, this will produce a C source code file
called lex.yy.c
meant to be compiled with a tool like gcc
,
e.g.:
$ gcc -o mylexer lex.yy.c
Use the flex
flag -o outfile
to change the
name of the output.
For more documentation, run info flex
.
Basic file structure
The specification files for lex have three sections,
each separated by %%
.
definitions, options, declarations %% rules %% C code
For a standalone lexer,
the main
function
can go in the C code section.
This main
function can repeatedly call
yylex()
and print out each token.
Definitions, options and declarations
The first section can include definitions, options, declarations and C code.
The format for a definition is:
name regex
A regex named name
may be referenced
with curly braces: {name}
.
For example:
digit [0-9] decimal [1-9]{digit}+\.{digit}*
By convention, indented lines in this section are treated as declaration-level C code and emitted toward the top of the generated lexer.
It is also possible to wrap unindented C code in
%{ ... %}
brackets.
Options, preceeded by %option
,
may also be specified here.
Common options include:
%option case-insensitive
, which makes the lexer case-insensitive;%option yylineno
, which makes the current line number available in the global C variableyylineno
; and-
%option noyywrap
, which prevents the generated lexer from invoking the user-supplied functionyywrap
when it reaches the end of the current file.
It is also possible to declare lexical analysis states
using the %x
and %s
directives; for example:
%x INSTRING INCOMMENT
declares two "exclusive" analysis states,
INSTRING
and INCOMMENT
.
(Lex calls these "start conditions" instead of "lexical analysis states.")
Lexical specification rules can be associated with specific lexical analysis states, and rules can trigger movement from one state to another.
(The directive %x
defines "exclusive" states,
while the directive %s
defines "inclusive" states.)
Rules
To understand the rules section, one must understand how the generated lexer works.
To invoke the generated lexer, the user-supplied C code will call
the function yylex
.
The function yylex
will begin consuming
input from the file handle yyin
(which is standard
input by default).
To determine what to do, yylex
will look to the rules section.
The rules section has a sequence of rules.
Each rule has one of two forms--either a default form:
regex action
or a form parameterized by lexical analysis states:
<state1,...,staten> regex action
All lex programs have at least one (inclusive) state: INITIAL
.
Lex begins in the INITIAL
state.
Unparameterized rules are valid for all inclusive
states (declared with %s
).
While in an exclusive state, the lexer uses only rules declared for that state.
The procedure yylex
will find the rule with the regex that
matches the longest possible prefix of the input, and execute the corresponding
C code.
For example, if the remaining input is foobarfoo foobar
,
and the rules are:
foobar { printf("1: ") ; ECHO ; printf("\n"); } (foo|bar)* { printf("2: ") ; ECHO ; printf("\n"); } [ ] { printf("space\n") ; }
the lexer will print:
2: foobarfoo space 1: foobar
When two rules tie for the longest match, the rule listed first wins.
To create a standalone lexer, the C code should print the token (or ignore it as appropriate) in the specified format.
Inside an action, lex provides a few variables, functions and macros:
char* yytext
points to the matched text.unsigned int yyleng
is the length of the matched text.unsigned int yylineno
, if enabled, is the current line.BEGIN(state)
changes to a new lexical state.ECHO;
will print the matched text.REJECT;
will fall through to the next best match.unput(char)
puts a character on the front of the input stream.return value;
will return a value fromyylex
.
To match end-of-file, use the special pattern
<<EOF>>
.
Lex supports a richer regex vocabulary than standard POSIX. It is worth perusing
info flex
to see what it supports.
C code
The final section contains C code.
For a standalone lexer, this is good place
to put a main function that repeatedly
invokes yylex
and prints each token.
Example: Comment-density checker
When I was a TA, I discovered that comment density in code roughly correlated with final grade.
It was reliable enough that I eventually sorted homeworks to grade in order from highest to lowest comment density.
Sorting like this helped me reliably calibrate my scoring rubric.
Here's a simple lex tool to compute comment density for C programs:
%{ #include <stdio.h> #include <stdlib.h> unsigned int code = 0 ; /* Bytes of code. */ unsigned int comm = 0 ; /* Bytes of comments. */ #define CODE {code += strlen(yytext);} #define COMM {comm += strlen(yytext);} %} %option noyywrap %x INCOMMENT INSTRING %% /* Switch to comments on '/*' */ <INITIAL>"/*" { COMM ; BEGIN(INCOMMENT) ; } <INCOMMENT>"*/" { COMM ; BEGIN(INITIAL) ; } <INCOMMENT>.|\n { COMM ; } /* Switch to string mode on '"' */ <INITIAL>\" { CODE ; BEGIN(INSTRING) ; } <INSTRING>\\\" { CODE ; } /* Escaped quote. */ <INSTRING>\" { CODE ; BEGIN(INITIAL) ; } <INSTRING>.|\n { CODE ; } <INITIAL>['](\\)?\"['] { CODE ; } /* Character quote. */ <INITIAL>.|\n { CODE ; } <<EOF>> { return 0 ; } %% int main(int argc, char* argv[]) { yylex() ; /* Prints code bytes, comment bytes, comment density. */ printf("%u %u %lf\n", code, comm, (double)comm/(double)(code+comm)) ; return EXIT_SUCCESS ; }
Example: Significant whitespace (Python)
When I teach compilers, I make students implement Python 3.x.
I do this to keep lexical analysis from becoming mere transcriptive tedium.
The lexical specification for the Python language explains how to lexically analyze significant whitespace.
In short, the lexer inserts INDENT
and
DEDENT
tokens as
indentation changes.
For example, the following Python code:
def f(x): if x >= 1: return x * x else: return x print 3
tokenizes as:
(KEYWORD def) (ID "f") (PUNCT "(") (ID "x") (PUNCT ")") (PUNCT ":") (NEWLINE) (INDENT) (KEYWORD if) (ID "x") (PUNCT ">=") (LIT 1) (PUNCT ":") (NEWLINE) (INDENT) (KEYWORD return) (ID "x") (PUNCT "*") (ID "x") (NEWLINE) (DEDENT) (KEYWORD else) (PUNCT ":") (NEWLINE) (INDENT) (KEYWORD return) (ID "x") (NEWLINE) (DEDENT) (DEDENT) (ID "print") (LIT 3) (NEWLINE)
Getting this behavior right with lex
is tricky.
One way to make it easier is to use
the little-known unput(char)
procedure, which puts characters back on the input stream.
Here's a short lex
program that uses this technique to insert curly braces
and semicolons into "Python-like" input:
%{ #include <stdlib.h> #include <stdio.h> #include <assert.h> #define MAX_DEPTH 72 int nesting = 0 ; unsigned int indent_stack[MAX_DEPTH] ; unsigned int level = 0 ; unsigned int first = 1 ; void process_indent(char* line) ; %} PUNCT [-+*/=<>:]* ID [A-Za-z_][A-Za-z0-9_]* INT [0-9]+ %option noyywrap %% ^[ ]*\n {/* Ignore blank lines. */} ^[ ]*[^ \n]+ {int last = yyleng - 1; process_indent(yytext) ; while ((last >= 0) && (yytext[last] != ' ')) { unput(yytext[last]); last-- ;}} "(" {printf(" "); ECHO; nesting++ ; } ")" {printf(" "); ECHO; nesting-- ; } {ID}|{INT}|{PUNCT} {printf(" ") ; ECHO;} [ \r] {} \n {} <<EOF>> { process_indent("") ; return 0 ; } %% unsigned int white_count(char* line) { unsigned int count = 0 ; while (*line == ' ') count++, line++ ; return count ; } void process_indent(char* line) { if (nesting) /* Ignore indents while nested. */ return ; unsigned int indent = white_count(line) ; if (indent == indent_stack[level]) { if (!first) printf(" ;") ; first = 0 ; return ; } if (indent > indent_stack[level]) { printf(" {") ; assert(level+1 < MAX_DEPTH) ; indent_stack[++level] = indent ; return ; } while (indent < indent_stack[level]) { --level ; printf(" }") ; } assert(level >= 0) ; } int main(int argc, char* argv[]) { indent_stack[0] = 0 ; yylex() ; printf("\n") ; }
On this input:
def f(x): y = x * x e = (m * c * c) if x > 0: return e else: return y
the transformer outputs:
def f ( x ) : { y = x * x ; e = ( m * c * c ) ; if x > 0 : { return e } else : { return y } }
(Inside parentheses, the off-side rule is temporarily disabled.)
Example: Standalone calc lexer
No introduction to lex would be complete without the calculator example.
A few examples illuminate this language better than a specification:
3 + 2
x = 3 ; 4 + x
f(x) = x^2 ; f(20) + 10
The following lex specification is the standalone S-Expression-producing variant of the lexer for a simple calculator language:
%{ #include <stdio.h> #include <stdlib.h> %} %option yylineno %option noyywrap ID [A-Za-z_][A-Za-z_0-9]* INT -?[1-9][0-9]* OP [-+*/^=] %% /* Print delimiters. */ [(] {printf("(left-parenthesis %u)\n", yylineno);} [)] {printf("(right-parenthesis %u)\n", yylineno);} [;] {printf("(semicolon %u)\n", yylineno);} /* Print identifiers, integers and operators. */ {INT} {printf("(int %s %u)\n",yytext, yylineno);} {ID} {printf("(id \"%s\" %u)\n", yytext, yylineno);} {OP} {printf("(op \"%s\" %u)\n", yytext, yylineno);} /* Ignore comments and whitespace. */ #[^\n]* {} [ \t\r\n] {} <<EOF>> {printf("(eof %u)\n", yylineno); return 0;} %% int main(int argc, char* argv[]) { yylex() ; return EXIT_SUCCESS ; }
Line-number information has been inserted into the S-Expressions, in case the downstream consumer wants to report errors positionally.
Common lex pitfalls and mistakes
In my experience using and teaching lex, there are many common mistakes:
-
.
instead of(.|\n)
. In most regex implementations, the dot pattern,.
, matches any character. In lex, it matches any character except newline. -
.*
instead of.
The pattern.*
will almost certainly consume more characters than you expect. It's the same for patterns of the form[^...]*
. When these patterns are used, they end up overruling the rule that you think is going to fire, leading to bizarre bugs. -
^[ ]*
to match indent. If you need to match an indent, it's tempting to use this pattern. Unfortunately, it will not be the longest match if there is no indent. To match indents, use theunput()
trick from the second example. -
[ \t\r\n]+
to match whitespace. Generally, you want to match whitespace one character at a time. If you match aggressively with this pattern, it will consume multiple blank lines, and prevent rules with the anchors^
and$
from firing. -
Use
yywrap()
for<<EOF>>
. If you want to match end of file, use the<<EOF>>
pattern instead of hacking it withyywrap()
. -
Use
yytext
andunput
. By default,unput
manglesyytext
. To useunput
, first duplicateyytext
, or use the directive%array
.
Related items
- The flex manual is a good place to start.
- A decade ago, my textbook for scripting language implementation was O'Reilly's lex & yacc; the updated version is flex & bison: If what you want is a manual on these tools, this book is great. If you want a book on compiler/interpreter construction, this book is incomplete. Its major motivating example is SQL.
- Despite its age, there is still much agreement that the best text on lex and yacc is Introduction to Compiler Construction With Unix.