Standalone lexers with lex: synopsis, examples, and pitfalls

Lexical analysis is the first phase of compilation.

During this phase, source code received character-by-character is transformed into a sequence of "tokens."

For example, for the following Python expression:

  print (3 + x 
    *2 ) # comment

the resulting stream of tokens might be (encoded as S-Expressions as):

 (keyword "print")
 (delim "(")
 (int 3)
 (punct "+")
 (id "x")
 (punct "*")
 (int 2)
 (delim ")")

Lexical analysis strips away insignificant whitespace and comments, and it groups the remaining characters into individual tokens.

Using the Unix tool lex, it's possible to create "standalone" lexers.

In a compiler/interpreter toolchain where each phase is standalone, a shell pipeline plugs them together, e.g.:

  tokenize < input | parse | interpret

Read on for an introduction to lex with synopses, examples and pitfalls.

Worked examples include:

a comment-density calculator for C;
a desugarer for Python-like significant whitespace; and
a standalone lexer for the obligatory calc language.

The tool `lex` in a nutshell

The program lex consumes a lexical specification and generates C source code for a lexical analyzer.

Typically, regular expressions describe classes of tokens in a spec.

[See my post on sculpting text for a definition of regular expressions.]

For example, an informal lexical specification might be:

identifiers match [A-Za-z_][A-Za-z_0-9]*;
integers match -?[1-9][0-9]+;
=, * and + are operators;
;, (, and ) are delimiters;
if and print are keywords; and
whitespace is ignored.

In ordinary usage, lex generates a file that can be linked in with the rest of the compiler/interpreter toolchain.

To communicate with downstream phases, the C code that lex generates exports the procedure yylex: when a downstream phase needs the next token, it calls yylex().

Alternatively, the generated C program can be made standalone, in which case it dumps the result of its lexical analysis into a file or to standard out.

(In choosing a format to dump, it helps to pick one with widespread parsing library support, such as XML or S-Expressions.)

For example, for the code def f(x): return 3, the generated S-Expressions might be:

 (keyword "def")
 (id "f")
 (delim "(")
 (id "x")
 (delim ")")
 (delim ":")
 (keyword "return")
 (int 3)

Using (f)lex

On most Unix systems, flex is the modern stand-in for lex.

As such, this article covers flex.

To use flex, run it on a specification file:

 $ flex file.l

By default, this will produce a C source code file called lex.yy.c meant to be compiled with a tool like gcc, e.g.:

 $ gcc -o mylexer lex.yy.c

Use the flex flag -o outfile to change the name of the output.

For more documentation, run info flex.

Basic file structure

The specification files for lex have three sections, each separated by %%.

definitions, options, declarations
%%
rules
%%
C code

For a standalone lexer, the main function can go in the C code section.

This main function can repeatedly call yylex() and print out each token.

Definitions, options and declarations

The first section can include definitions, options, declarations and C code.

The format for a definition is:

 name regex

A regex named name may be referenced with curly braces: {name}.

For example:

digit [0-9]
decimal [1-9]{digit}+\.{digit}*

By convention, indented lines in this section are treated as declaration-level C code and emitted toward the top of the generated lexer.

It is also possible to wrap unindented C code in %{ ... %} brackets.

Options, preceeded by %option, may also be specified here.

Common options include:

%option case-insensitive, which makes the lexer case-insensitive;
%option yylineno, which makes the current line number available in the global C variable yylineno; and
%option noyywrap, which prevents the generated lexer from invoking the user-supplied function yywrap when it reaches the end of the current file.

It is also possible to declare lexical analysis states using the %x and %s directives; for example:

 %x INSTRING INCOMMENT

declares two "exclusive" analysis states, INSTRING and INCOMMENT.

(Lex calls these "start conditions" instead of "lexical analysis states.")

Lexical specification rules can be associated with specific lexical analysis states, and rules can trigger movement from one state to another.

(The directive %x defines "exclusive" states, while the directive %s defines "inclusive" states.)

Rules

To understand the rules section, one must understand how the generated lexer works.

To invoke the generated lexer, the user-supplied C code will call the function yylex.

The function yylex will begin consuming input from the file handle yyin (which is standard input by default).

To determine what to do, yylex will look to the rules section.

The rules section has a sequence of rules.

Each rule has one of two forms--either a default form:

 regex action

or a form parameterized by lexical analysis states:

 <state₁,...,state_n> regex action

All lex programs have at least one (inclusive) state: INITIAL.

Lex begins in the INITIAL state.

Unparameterized rules are valid for all inclusive states (declared with %s).

While in an exclusive state, the lexer uses only rules declared for that state.

The procedure yylex will find the rule with the regex that matches the longest possible prefix of the input, and execute the corresponding C code.

For example, if the remaining input is foobarfoo foobar, and the rules are:

 foobar       { printf("1: ") ; ECHO ; printf("\n"); }
 (foo|bar)*   { printf("2: ") ; ECHO ; printf("\n"); }
 [ ]          { printf("space\n") ; }

the lexer will print:

2: foobarfoo
space
1: foobar

When two rules tie for the longest match, the rule listed first wins.

To create a standalone lexer, the C code should print the token (or ignore it as appropriate) in the specified format.

Inside an action, lex provides a few variables, functions and macros:

char* yytext points to the matched text.
unsigned int yyleng is the length of the matched text.
unsigned int yylineno, if enabled, is the current line.
BEGIN(state) changes to a new lexical state.
ECHO; will print the matched text.
REJECT; will fall through to the next best match.
unput(char) puts a character on the front of the input stream.
return value; will return a value from yylex.

To match end-of-file, use the special pattern <<EOF>>.

Lex supports a richer regex vocabulary than standard POSIX. It is worth perusing info flex to see what it supports.

C code

The final section contains C code.

For a standalone lexer, this is good place to put a main function that repeatedly invokes yylex and prints each token.

Example: Comment-density checker

When I was a TA, I discovered that comment density in code roughly correlated with final grade.

It was reliable enough that I eventually sorted homeworks to grade in order from highest to lowest comment density.

Sorting like this helped me reliably calibrate my scoring rubric.

Here's a simple lex tool to compute comment density for C programs:

%{
#include <stdio.h>
#include <stdlib.h>

unsigned int code = 0 ; /* Bytes of code. */
unsigned int comm = 0 ; /* Bytes of comments. */

#define CODE {code += strlen(yytext);}
#define COMM {comm += strlen(yytext);}
%}

%option noyywrap

%x INCOMMENT INSTRING 

%%

 /* Switch to comments on '/*' */
<INITIAL>"/*"      { COMM ; BEGIN(INCOMMENT) ; }
<INCOMMENT>"*/"    { COMM ; BEGIN(INITIAL) ; }
<INCOMMENT>.|\n    { COMM ; }

 /* Switch to string mode on '"' */
<INITIAL>\"      { CODE ; BEGIN(INSTRING) ; }
<INSTRING>\\\"   { CODE ; } /* Escaped quote. */
<INSTRING>\"     { CODE ; BEGIN(INITIAL) ; }
<INSTRING>.|\n   { CODE ; }

<INITIAL>['](\\)?\"['] { CODE ; } /* Character quote. */
<INITIAL>.|\n          { CODE ; }

<<EOF>>   { return 0 ; } 

%%

int main(int argc, char* argv[]) {

  yylex() ; 

  /* Prints code bytes, comment bytes, comment density. */
  printf("%u %u %lf\n", 
         code,
         comm,
         (double)comm/(double)(code+comm)) ; 

  return EXIT_SUCCESS ;
}

Example: Significant whitespace (Python)

When I teach compilers, I make students implement Python 3.x.

I do this to keep lexical analysis from becoming mere transcriptive tedium.

The lexical specification for the Python language explains how to lexically analyze significant whitespace.

In short, the lexer inserts INDENT and DEDENT tokens as indentation changes.

For example, the following Python code:

def f(x):
  if x >= 1:
    return x * x
  else:
    return x

print 3

tokenizes as:

(KEYWORD def)
(ID "f")
(PUNCT "(")
(ID "x")
(PUNCT ")")
(PUNCT ":")
(NEWLINE)
(INDENT)
(KEYWORD if)
(ID "x")
(PUNCT ">=")
(LIT 1)
(PUNCT ":")
(NEWLINE)
(INDENT)
(KEYWORD return)
(ID "x")
(PUNCT "*")
(ID "x")
(NEWLINE)
(DEDENT)
(KEYWORD else)
(PUNCT ":")
(NEWLINE)
(INDENT)
(KEYWORD return)
(ID "x")
(NEWLINE)
(DEDENT)
(DEDENT)
(ID "print")
(LIT 3)
(NEWLINE)

Getting this behavior right with lex is tricky.

One way to make it easier is to use the little-known unput(char) procedure, which puts characters back on the input stream.

Here's a short lex program that uses this technique to insert curly braces and semicolons into "Python-like" input:

%{
#include <stdlib.h>
#include <stdio.h>
#include <assert.h>

#define MAX_DEPTH 72

int nesting = 0 ;
unsigned int indent_stack[MAX_DEPTH] ;
unsigned int level = 0 ;

unsigned int first = 1 ;

void process_indent(char* line) ;

%}

PUNCT [-+*/=<>:]*
ID  [A-Za-z_][A-Za-z0-9_]*
INT [0-9]+

%option noyywrap

%%


^[ ]*\n       {/* Ignore blank lines. */}
^[ ]*[^ \n]+  {int last = yyleng - 1;
               process_indent(yytext) ;
               while ((last >= 0) &&
                      (yytext[last] != ' ')) {
                unput(yytext[last]);
                last-- ;}}


"("  {printf(" "); ECHO; nesting++ ; }
")"  {printf(" "); ECHO; nesting-- ; }

{ID}|{INT}|{PUNCT} {printf(" ") ; ECHO;}

[ \r] {}
\n    {}

<<EOF>>  { process_indent("") ; return 0 ; }

%%

unsigned int white_count(char* line) {
  unsigned int count = 0 ;
  while (*line == ' ')
    count++, line++ ;
  return count ;
}

void process_indent(char* line) {
  if (nesting)
    /* Ignore indents while nested. */
    return ;

  unsigned int indent = white_count(line) ;

  if (indent == indent_stack[level]) {
    if (!first) printf(" ;") ;
    first = 0 ;
    return ;
  }

  if (indent > indent_stack[level]) {
    printf(" {") ;
    assert(level+1 < MAX_DEPTH) ;
    indent_stack[++level] = indent ;
    return ;
  }

  while (indent < indent_stack[level]) {
    --level ;
    printf(" }") ;
  }

  assert(level >= 0) ;
}

int main(int argc, char* argv[]) {
  indent_stack[0] = 0 ;
  yylex() ;
  printf("\n") ;
}

On this input:

def f(x):
  y = x * x
  e = (m * c
* c)
  if x > 0: 
    return e 
  else:
    return y

the transformer outputs:

 def f ( x ) : { y = x * x ; e = ( m * c * c ) ; 
 if x > 0 : { return e } else : { return y } }

(Inside parentheses, the off-side rule is temporarily disabled.)

Example: Standalone calc lexer

No introduction to lex would be complete without the calculator example.

A few examples illuminate this language better than a specification:

3 + 2

x = 3 ;
4 + x

f(x) = x^2 ;
f(20) + 10

The following lex specification is the standalone S-Expression-producing variant of the lexer for a simple calculator language:

%{
#include <stdio.h>
#include <stdlib.h>
%}

%option yylineno
%option noyywrap

ID     [A-Za-z_][A-Za-z_0-9]*
INT    -?[1-9][0-9]*
OP     [-+*/^=]

%%

 /* Print delimiters. */
[(]         {printf("(left-parenthesis %u)\n", yylineno);}
[)]         {printf("(right-parenthesis %u)\n", yylineno);}
[;]         {printf("(semicolon %u)\n", yylineno);}

 /* Print identifiers, integers and operators. */
{INT}       {printf("(int %s %u)\n",yytext, yylineno);}
{ID}        {printf("(id \"%s\" %u)\n", yytext, yylineno);}
{OP}        {printf("(op \"%s\" %u)\n", yytext, yylineno);}

 /* Ignore comments and whitespace. */
#[^\n]*     {}
[ \t\r\n]   {}

<<EOF>>     {printf("(eof %u)\n", yylineno); return 0;}

%%

int main(int argc, char* argv[]) {
  yylex() ;
  return EXIT_SUCCESS ;
}

Line-number information has been inserted into the S-Expressions, in case the downstream consumer wants to report errors positionally.

Common lex pitfalls and mistakes

In my experience using and teaching lex, there are many common mistakes:

. instead of (.|\n). In most regex implementations, the dot pattern, ., matches any character. In lex, it matches any character except newline.
.* instead of . The pattern .* will almost certainly consume more characters than you expect. It's the same for patterns of the form [^...]*. When these patterns are used, they end up overruling the rule that you think is going to fire, leading to bizarre bugs.
^[ ]* to match indent. If you need to match an indent, it's tempting to use this pattern. Unfortunately, it will not be the longest match if there is no indent. To match indents, use the unput() trick from the second example.
[ \t\r\n]+ to match whitespace. Generally, you want to match whitespace one character at a time. If you match aggressively with this pattern, it will consume multiple blank lines, and prevent rules with the anchors ^ and $ from firing.
Use yywrap() for <<EOF>> . If you want to match end of file, use the <<EOF>> pattern instead of hacking it with yywrap().
Use yytext and unput. By default, unput mangles yytext. To use unput, first duplicate yytext, or use the directive %array.

Related items

The flex manual is a good place to start.
A decade ago, my textbook for scripting language implementation was O'Reilly's lex & yacc; the updated version is flex & bison:

If what you want is a manual on these tools, this book is great. If you want a book on compiler/interpreter construction, this book is incomplete. Its major motivating example is SQL.
Despite its age, there is still much agreement that the best text on lex and yacc is Introduction to Compiler Construction With Unix.