McLexer an McHighlight: Lexical analyzers and syntax-highlighters in JavaScript

[article index] [] [@mattmight] [+mattmight] [rss]

Until now, all of the code on this blog has appeared in boring black-on-grey. To fix this, I wrote a lex-like lexical-analyzer system in JavaScript called McLexer. Then, I added a syntax-highlighting layer (called McHighlight), and created highlighters for JavaScript, Scheme, CSS and plain text. (It even works in IE6!)

For maximum control, the highlighters wrap span tags around elements in the source code, and give them class names indicating to which lexical class they belong, e.g. keyword, identifier, comment, regular expression. This means that an external CSS file can control the highlighting for each language.

I actually had two additional excuses to write this up: I'm building a web-based interface and visualizer for my static analysis tools, and I'm teaching my research seminar how to do static analysis of JavaScript. (In my experience, a great way to learn how to program in a language is to write a compiler in it.)

This article includes (syntax-highlighted) source code for McLexer, McHighlight and the syntax-highlighters. Check back over time; I'll continue making improvements and adding highlighters for more languages.

More resources

McLexer: Lexers for JavaScript

McLexer is a lexical analysis framework for JavaScript. It's not a lexer-generator like lex; rather it's a domain-specific language embedded in JavaScript that produces lexical analyzers dynamically. For speed, simplicity and flexibility, it exploits JavaScript's built-in regular-expression utilities. For performance and scalability, the framework enables stackless, continuation-passing lexers, since JavaScript doesn't support tail-call optimization.

Quick comparison to lex

With a tool like lex, you might have rules that look like:

<INIT>   [_A-Za-z]+    { return(ID) ; }
<INIT>   [0-9]+        { return(NUM) ; } 

In McLexer, these might look like:

INIT    (/[_A-Za-z]+/)    (function () { return ID ; }) ;
INIT    (/[0-9]+/)        (function () { return NUM ; }) ;

Specifying lexers

A specification for a lexer starts by declaring the lexical analysis states, each an instance of McLexer.State. For example:

var INIT       = new McLexer.State() ;
var IN_COMMENT = new McLexer.State() ;

Then, each lexical analysis state is associated with a list of rules, where each rule is a regular expression paired with an action.

Each rule declaration has the form:

state (regex) (action) ;

Informally, such a rule means, "While in the state state, if a prefix of the input matches the pattern regex, then call action."

The parameter action is a function of the form:

  function (match,rest,state) {
    // match is an array produced from matching the prefix
    // rest is the remainder of the input
    // state is the current lexical analysis state

To run a lexer against an input, each state has a run method. The run method makes a single match against the input its given, and then performs the action.

In most cases, lexers should not stop after a single match. To achieve this behavior, functional lexers tail-call the lexer after each match. JavaScript has no tail-call optimzation, so a lexer which uses tail calls would run out of stack space on large inputs. To get around this restriction, McLexer actions can return a continuation to continue lexing.

To support this continuation-based protocol, one can invoke the lex method on a state instead of the run method. The lex method will repeatedly execute the continuations returned by actions until null comes back.

Every state provides a continuation method that, given the remainder of the input, produces a continuation that will continue lexing once invoked.

As a result, a common pattern for actions is:

  function (match,rest,state) {
    // Do something with match[0].
    return state.continuation(rest) ;

If the only behavior desired is to switch states, then the McCONTINUE function abbreviates this action:

 McCONTINUE(newState) == function (match,rest,state) {
                             return newState.continuation(rest) ;

McLexer tokenizer example

McLexer source code

McHighlight: Syntax-highlighting

McHighlight is a framework for building syntax-highlighters on top of McLexer. A new McHighligter object takes an initial lexical analysis state, and it expects to have its methods called by the actions in the state.

A new McHighligter object provides convenience methods that produce actions. These actions abbreviate the code required to, for example, wrap the matched input in a span with a given class, or "harden" whitespace by replacing each space with &nbsp and each newline with <br />. The highlight(element[,code]) method inserts highlighted code into the body of the supplied element; if no code is supplied, then the value of element.innerHTML is used as the code to highlight.

To use a highlighter on your site:

  • Make a copy of mchighlight.js, mchighlight.js and mchighlight-language.js on your site.
  • Include these files on your web page.
  • Include CSS style tag or file that defines how to style the classes for the language.
  • Call Highlighter.highlight(element) on a pre element that has code inside of it.
  • Report bugs to me!

McHighlight source code

McHighlight example: JavaScript

McHighlight example: Scheme

McHighlight example: CSS

McHighlight example: Plain text