Parsing S-Expressions in Scala

[article index] [] [@mattmight] [rss]

S-Expressions are a text-based representation of tree-structured data.

For instance, one might S-Express the data backing a "ship" as:

 (ship (registry 1701)
       (x        42)
       (y        13))

S-Expressions are the low-calorie version of XML.

S-Expressions are simple to parse, even without access to lexing and parsing tools such as lex and yacc.

When I teach, I require assignments to read and write S-Expressions, since this allows students to use any implementation language.

As a demonstration, this article creates a parser for simplified S-Expressions in Scala without the aid of lexing and parsing tools.

It then shows how to parse these S-Expressions into the abstract syntax tree for a simple Lisp-like programming language.

A simple format

Lisp-like programming languages employ S-Expressions for syntax.

In a synergistic twist, the universal run-time data-type in Lisps is also the S-Expression (or, rather, the run-time data structures that encode them).

The total unification of code and data is what makes Lisp macros unique among all programming languages.

S-Expressions (Symbolic Expressions) encode tree-like data with parentheses to denote grouping.

Minimalist S-Expressions

The grammar for the simplified dialect of S-Expressions in this post is:

  <file> ::= <s-exp-list> 

  <s-exp> ::= <atom>
           |  '(' <s-exp-list> ')'

  <s-exp-list> ::= <sexp> <s-exp-list>
                |  

  <atom> ::= <symbol>
          |  <integer>
          |  #t  |  #f

Comments begin with semicolon (;) and run to the end of the line.

Examples

As data, S-Expressions look like:

 (users
   ((uid 1)   (name root) (gid 1))
   ((uid 108) (name matt) (gid 108))
   ((uid 109) (name ralf) (gid 109)))

As code, S-Expressions look like:

 (define (fact n)
  (if (= n 0)
     1
     (* n (fact (- n 1)))))

Representing S-Expressions

We'll represent S-Expressions with the type SExp:

abstract class SExp {

  /**
   Returns a true S-Exp list as Scala list of S-Exps.
   */
  def toList : List[SExp]

  /**
   Returns true if this is a true S-Exp list; false otherwise.
   */
  def isList : Boolean 
}

Sub-classes represent the core forms as a tree:

case class SInt(val value : Int) extends SExp { ... }
case class SSymbol(val value : String) extends SExp { ...}
case class STrue() extends SExp { ... }
case class SFalse() extends SExp { ... }
case class SCons(val car : SExp, val cdr : SExp) extends SExp { ... }
case class SNil() extends SExp { ... }

Lexing S-Expressions

The class SExpLexer tokenizes a stream of characters:

class SExpLexer(var input : Stream[Char]) {

  // External interface to lexer:
  def peek() : SExpToken 
  def next() : SExpToken 
  def eatLPAR() 
  def eatRPAR() 

  // Internal lexical analysis state:
  private var state : SExpLexerState 

  private trait SExpLexerState {
    def process(c : Char) : SExpLexerState 
    def processEOF() : SExpLexerState
  }

  ...  
}

Tokens themselves have a simple representation:

private trait SExpToken

private case object LPAR extends SExpToken
private case object RPAR extends SExpToken
private case object EOS  extends SExpToken

private case class INT(value : Int) extends SExpToken
private case class HASH(value : String) extends SExpToken
private case class ID(value : String) extends SExpToken

The lexer has six internal states:

  private case object DONE extends SExpLexerState 
  private case object INCOMMENT extends SExpLexerState 
  private case class  INID(buf : List[Char]) extends SExpLexerState 
  private case class  INHASH(buf : List[Char]) extends SExpLexerState 
  private case class  INNUM(buf : List[Char]) extends SExpLexerState 
  private case object INWHITESPACE extends SExpLexerState 

For each kind of state, the methods process(Char) and processEOF() handle the next input, return the new state and may emit zero or more tokens.

The parser interfaces with the lexer by invoking peek(), next() and eatToken().

Parsing with recursive descent

A recursive descent parser has a parsing function for each grammar rule, and it uses look-ahead to craft procedures that follow the natural structure of these rules:

class SExpParser(private val input : Stream[Char]) {

  private val lex = new SExpLexer(input)

  // <file> ::= <s-exp-list> 
  def nextFile () : List[SExp] = 
    lex.peek() match {

      case EOS => List.empty

      case _ => {
        val head = nextSExp()
        val tail = nextFile()
        head :: tail
      }
    }

  // <s-exp> ::= <atom> 
  //         |  '(' <s-exp-list> ')'
  def nextSExp () : SExp = 
    lex.peek() match {

      case EOS => throw ParseException("expected s-exp; got end of input") 
   
      case LPAR => {
        lex.eatLPAR() 
        val sexp = nextSExpList()
        lex.eatRPAR()
        sexp
      }

      case INT(value) => { lex.next() ; SInt(value) }
      case ID(value)  => { lex.next() ; SSymbol(value) }
      case HASH("t")  => { lex.next() ; STrue() }
      case HASH("f")  => { lex.next() ; SFalse() }
    }

  // <s-exp-list> ::= <sexp> <s-exp-list>
  //               |  
  private def nextSExpList () : SExp = 
    lex.peek() match {

      case RPAR => SNil()

      case _ => {
        val head = nextSExp()
        val tail = nextSExpList() 
        SCons(head,tail)
      }
    }
}

From S-Expressions to syntax trees

Once the S-Expressions have been read, a straightforward tree walk can convert them into an internal data structure.

For instance, suppose we want to parse a small, Scheme-like language.

First, we render the grammar as an abstract syntax tree in Scala:

abstract class Exp

/* Core lambda calculus forms. */
case class RefExp(val id : String) extends Exp
case class LambdaExp(val params : List[String],
                     val body : Exp) extends Exp
case class AppExp(val fun : Exp,
                  args : List[Exp]) extends Exp


/* Scheme forms. */

// Conditionals:
case class BoolExp(val value : Boolean) extends Exp
case class IfExp(val cond : Exp, 
                 val ifTrue : Exp, 
                 val ifFalse : Exp) extends Exp
case class AndExp(val cond1 : Exp, val cond2 : Exp) extends Exp
case class OrExp(val cond1 : Exp, val cond2 : Exp) extends Exp


// Numerics:
case class IntExp(val value : Int) extends Exp
case class ZeroPExp(val test : Exp) extends Exp
case class SubExp(val exp1 : Exp, val exp2 : Exp) extends Exp
case class EqExp(val exp1 : Exp, val exp2 : Exp) extends Exp
case class PlusExp(val exp1 : Exp, val exp2 : Exp) extends Exp
case class TimesExp(val exp1 : Exp, val exp2 : Exp) extends Exp


// Binding and recursion:
case class LetExp(val vars : List[String],
                  val exps : List[Exp],
                  val body : Exp) extends Exp

case class LetRecExp(val fun : String,
                     val lambda : Exp,
                     val body : Exp) extends Exp


// Lists:
case class ConsExp(val car : Exp, val cdr : Exp) extends Exp
case class CarExp(val arg : Exp) extends Exp
case class CdrExp(val arg : Exp) extends Exp
case class PairPExp(val arg : Exp) extends Exp
case class NullPExp(val arg : Exp) extends Exp
case class NullExp() extends Exp

And, a companion object the procedure from, a tree-walker that extracts a Scheme expression from an S-Expression:

object Exp {
  def from(sexp : SExp) : Exp = {
    sexp match {

      // References:
      case SSymbol(id) => RefExp(id)

      // Lambda terms:
      case SList(SSymbol("lambda"),params,body) => {
        val vars = params.toList map { case SSymbol(id) => id }
        LambdaExp(vars,from(body))
      }
       

      // Conditionals:
      case STrue() => BoolExp(true)

      case SFalse() => BoolExp(false)

      case SList(SSymbol("if"),cond,ifTrue,ifFalse) =>
       IfExp(from(cond), from(ifTrue), from(ifFalse))

      case SList(SSymbol("and"),a,b) =>
       AndExp(from(a),from(b))

      case SList(SSymbol("or"),a,b) =>
       OrExp(from(a),from(b))


      // Numerics:
      case SInt(value) => IntExp(value) 

      case SList(SSymbol("zero?"), arg) =>
       ZeroPExp(from(arg))

      case SList(SSymbol("-"),a,b) =>
       SubExp(from(a),from(b))

      case SList(SSymbol("+"),a,b) =>
       PlusExp(from(a), from(b))

      case SList(SSymbol("*"),a,b) =>
       TimesExp(from(a),from(b))

      case SList(SSymbol("="),a,b) =>
       EqExp(from(a), from(b))

      
     // Lists:
     case SList(SSymbol("cons"),car,cdr) =>
      ConsExp(from(car),from(cdr))

     case SList(SSymbol("car"),arg) =>
      CarExp(from(arg))

     case SList(SSymbol("cdr"),arg) =>
      CdrExp(from(arg))

     case SList(SSymbol("quote"),SList()) => 
      NullExp()

     case SList(SSymbol("pair?"),arg) => 
      PairPExp(from(arg))

     case SList(SSymbol("null?"),arg) => 
      NullPExp(from(arg))


     // Binding forms:
     case SList(SSymbol("let"),
                bindings,
                body) => {
        val varexps = 
         bindings.toList map {
           case SList(SSymbol(id),exp) =>
             (id,from(exp))
         }
        val (vars,exps) = varexps.unzip
        LetExp(vars,exps,from(body))
      }
    

     case SList(SSymbol("letrec"),
                SList(SList(SSymbol(fun),lambda)),
                body) =>
       LetRecExp(fun,from(lambda),from(body))
    

     // Application
     case SCons(fun,args) => 
       AppExp(from(fun), args.toList map from)

    } 
  }
}

where the procedure from makes significant use of custom extractor SList:

object SList {

  /**
   Matches if the underlying S-Expression is a true list, and
   converts it to a Scala list.
   */ 
  def unapplySeq (sexp : SExp) : Option[List[SExp]] = {
    if (sexp.isList) {
      Some(sexp.toList)
    } else {
      None
    }
  }
}

Code

The full code is available: SExp.scala, Exp.scala.

Related pages