Parsing BibTeX in Racket and generating S-Expressions, JSON, XML and BibTeX

[article index] [] [@mattmight] [+mattmight] [rss]

BibTeX is a tool for managing and automatically generating bibliographies when using the TeX/LaTeX document preparation system.

It takes much of the tedium out of the process of writing and formatting bibliographies.

At the heart of BibTeX is a plaintext .bib file with entries like:

  author =       {Matthew Might and Olin Shivers},
  title =        {Environment Analysis via {$\Delta$CFA}},
  booktitle =    {Proceedings of the 33rd Annual {ACM} Symposium 
                  on the Principles of Programming Languages
                  ({POPL} 2006)},
  pages =        {127--140},
  year =         {2006},
  address =      {Charleston, South Carolina},
  month =        {January},
  abstract = {
    We  describe a new program-analysis framework, based on CPS
    and procedure-string abstractions, that can handle critical
    analyses which the k-CFA framework cannot. We present the main
    theorems concerning correctness, show an application analysis,
    and describe a running implementation.
  annote = {
    My first publication.

    TODO: Explore whether this works better with pushdown systems.

Researchers also take notes in their .bib files, so that they serve as a searchable record of one’s journey through the literature.

To make it easier to search, transform and extract from BibTeX files, I created a simple command-line Racket script for parsing BibTeX.

The script supports:

  • inlining all @string declarations;
  • transforming into S-Expressions;
  • transforming into JSON;
  • transforming into XML; and
  • transforming into a canonical BibTeX format.

With the ability to transform .bib into a wide variety of accessible formats, it unlocks this richly structured data for more than simply managing citations: it becomes a platform for auto-generating CVs, research reports, web sites and more.

Read on for a short article on how to parse BibTeX.

The tool: bib2sx

If you just want the tool, there’s no reason to read the full article.

bib2sx is on gitub.

By default, it takes a .bib file on stdin and outputs S-Expressions on stdout.

Flags unlock more features:

  • --inline inlines @string definitions;
  • --flatten joins string values together, preserving BibTeX quoting;
  • --json outputs JSON;
  • --xml outputs XML; and
  • --bib outputs canonicalized BibTeX.

Output format

The output format for the S-Expressions fits into:

 <bibtex-ast> ::= (<item> ...)

 <item> ::= (<item-type> <id> <attr> ...)

 <attr> ::= (<name> <expr> ...)
         |  (<name> . <string>)  ; with --flatten

 <expr> ::= <string>
         |  '<expr>
         |  '(<expr> ...)
         |  <id>

 <name> ::= <id>

 <item-type> ::= inproceedings | article | ...

LaTeX resources

If you’d like to learn more about LaTeX (and BibTeX) in general, I recommend The LaTeX Companion:

If you want to get deep into the guts of TeX itself, Knuth’s very own reference is one of the few resources out there:

Why write a BibTeX parser?

I’m tired of manually extracting data from my BibTeX file.

In the past year, here’s a list of all the different formats into which I have had to manually transform my bibliographic data:

  • generating my CV
  • updating publications for my web site
  • updating publications for my lab
  • uploading publications for the faculty annual review site
  • uploading publications produced under an NSF grant (x4)
  • uploading publications for a DARPA quarterly report (x9)
  • formatting publications for a DoE annual report
  • formatting publications for an NSF biosketch (x2)
  • formatting publications for an expanded NSF biosketch (x2)
  • formatting publications for a DoE proposal
  • formatting publications for an NIH biosketch (x2)

It’s insane to keep doing all of this manually.

The data is all inside of a BibTeX file.

It just needs to be moved into a format that’s easy to transform.

Why write yet another BibTeX parser?

Unfortunately, despite its apparent simplicity, there are a few quirks in the file format that make it challenging to properly use with pretty much any tool other than BibTeX.

I tried a few BibTeX parsing packages out there, only to find they all had subtle issues with formatting.

For instance, the built-in Racket BibTeX parser in scriblib/bibtex doesn’t preserve nested braces inside strings, and it deletes some whitespace, while preserving other whitespace.

As a result, an entry like this:

  author = { Matthew Might },
  title = { Why parsing {BibTeX} is hard },
  journal = { Journal of {LaTeX} },
  year = 2015

turns into this:

  #hash((type . "Article")
        ("author" . " Matthew Might ")
        ("title" . " Why parsing BibTeXis hard ")
        ("journal" . " Journal of LaTeX")
        ("year" . "2015")))

Unfortunately, the nested braces have semantic value to the formatter: they’re telling it not to change capitalization when emitting bibliographic entries.

(One of the odd parts of BibTeX is its aggressively ignorant recapitalization in the service of formatting guidelines, and the nested braces serve as a manual override for that ignorance.)

I’ve run this tool through some sizeable BibTeX files from a variety of sources, so the tool should be able to parse most .bib files.

If you find a file it can’t handle, let me know.

Lexically analyzing BibTeX

As with languages such as Python (indentation sensitivity), JavaScript (regex versus division) and C (typename versus identifier), the complexity in BibTeX is easily handled by granting context-awareness to the lexical analyzer.

The case of BibTeX, the context of which the lexer needs to be aware is:

  • Am I currently inside a string literal value? (E.g., "foo bar")

  • What is the current level of curly-brace nesting?

The same lexeme may become different tokens depending on the answer to these questions.

In short, inside a string literal, virtually everything (including whitespace) should be treated as a literal string token.

And, the same becomes true once the level of {}-nesting is two deep or more, except that the whitespace inside the second level of braces should still be ignored.

A lexer in Racket

If you are not familiar with the lexical analysis package in Racket, I recommend my post on lexing in Racket.

The first step is to create abbreviations and token types for the lexer:

(define-lex-abbrev bibtex-id 
  (:+ (char-complement (char-set " \t\r\n{}@#,\\\""))))

(define-empty-tokens PUNCT (@ |{| |}| |"| |#| |,| =))
(define-empty-tokens EOF (EOF))

(define-tokens EXPR (ID STRING SPACE))

And, then, the lexer itself:

(define (bibtex-lexer port [nesting 0] [in-quotes? #f])
  ; helpers to recursively call the lexer with defaults:
  (define (lex port) 
    (bibtex-lexer port nesting in-quotes?))
  (define (lex+1 port) 
    ; increase {}-nesting
    (bibtex-lexer port (+ nesting 1) in-quotes?))
  (define (lex-1 port) 
    ; increase {}-nesting
    (bibtex-lexer port (- nesting 1) in-quotes?))
  (define (lex-quotes port)
    ; toggle inside quotes
    (bibtex-lexer port nesting (not in-quotes?)))
  (define (not-quotable?) 
    ; iff not inside a string context
    (and (not in-quotes?) (< nesting 2)))
   [(:+ whitespace)
    (if (not-quotable?)
        (lex input-port)
        (stream-cons (token-SPACE lexeme)
                     (lex input-port)))]
    (stream-cons (if (not-quotable?) 
                     (token-#) (token-STRING lexeme))
                 (lex input-port))]
    (stream-cons (if (not-quotable?) 
                     (token-@) (token-STRING lexeme))
                 (lex input-port))]
    (stream-cons (if (not-quotable?) 
                     (token-=) (token-STRING lexeme))
                 (lex input-port))]
    (stream-cons (if (not-quotable?) 
                     (|token-,|) (token-STRING lexeme))
                 (lex input-port))]

       ; pretend we're closing a {}-string
       (stream-cons (|token-}|)
                    (lex-quotes input-port))]
      [(and (not in-quotes?) (= nesting 1))
       ; pretend we're opening a {}-string
       (stream-cons (|token-{|)
                    (lex-quotes input-port))]
      [(and (not in-quotes?) (>= nesting 2))
       (stream-cons (token-STRING lexeme)
                    (lex input-port))])]
    (stream-cons (token-STRING "\\")
                 (lex input-port))]
    (stream-cons (token-STRING "{")
                 (lex input-port))]
    (stream-cons (token-STRING "}")
                 (lex input-port))]
   [(:: "{" (:* whitespace))
      (stream-cons (|token-{|)
                   (if (and (<= nesting 1) (not in-quotes?))
                       (lex+1 input-port)
                       (if (= (string-length lexeme) 1)
                           (lex+1 input-port)
                           (stream-cons (token-SPACE (substring lexeme 1))
                                        (lex+1 input-port))))))]
   [(:: (:* whitespace) "}")
      (stream-cons (|token-}|)
                   (if (and (<= nesting 2) (not in-quotes?))
                       (lex-1 input-port)
                       (if (= (string-length lexeme) 1)
                           (lex-1 input-port)
                           (stream-cons (token-SPACE (substring 
                                                      lexeme 0 
                                                      (- (string-length
                                                          lexeme) 1)))
                                        (lex-1 input-port))))))]
   [(:+ numeric)
    (stream-cons (token-STRING lexeme)
                 (lex input-port))]
    (stream-cons (if (not-quotable?)
                     (token-ID (string->symbol lexeme)) 
                     (token-STRING lexeme))
                 (lex input-port))])

Parsing BibTeX

With most of the complexity moved to the lexer, the grammar for the parser is simplified:

<bibtex> ::= <item> ...

<item> ::= @ <id> { <taglist> }

<taglist> ::= <tag> , <taglist>
           |  <tag>

<tag> ::= <id>
       |  <id> = <expr>

<expr> ::= <atom> # <expr>
        |  <atom>

<atom> ::= <id>
        |  <string>
        |  <space>
        |  { <atom> ... }

It is straightforward to transcribe this lexer into Racket’s parser form:

(define bibtex-parse
    (itemlist [{item itemlist}  (cons $1 $2)]
              [{}               '()]
              [{ID   itemlist}  $2]
              [{|,|  itemlist}  $2])
    (item [{|@| ID |{| taglist |}|} 
           ; =>
           (cons (symbol-downcase $2) $4)])

    (tag [{ID}           $1]
         [{ID = expr}    (cons (symbol-downcase $1)
                               (flatten+simplify $3))])
    (expr [{atom |#| expr}       (cons $1 $3)]
          [{atom}                (list $1)])
    (atom  [{ID}                  (symbol-downcase $1)]
           [{STRING}              $1]
           [{SPACE}               $1]
           [{ |{| atomlist |}| }  (list 'quote $2)])
    (atomlist [{atom atomlist}    (cons $1 $2)]
              [{}                 '()])
    (taglist [{tag |,| taglist}   (cons $1 $3)]
             [{tag}               (list $1)]
             [{}                 '()])]
   [tokens PUNCT EOF EXPR]
   [start itemlist]
   [end EOF]
   [error (lambda (tok-ok? tok-name tok-value)
            (error (format "parsing error: ~a ~a ~a"
                           tok-ok? tok-name tok-value)))]))

Once the parser runs, the BibTeX database is available as a giant, easy-to-manipulate S-Expression.

It is worth noting that to preserve the structure of BibTeX strings, strings are split into lists of “expressions.”

An expression can be a literal string, an identifier (assignable by @string), a quoted string (indicating {}-wrapping) and a quoted sequence of expressions.

The tool bib2sx comes equipped with several useful transformers, including XML, JSON and BibTeX.


Given the following:

  latex = "LaTeX"

  author = "Matthew Might",
  title = {Why parsing {{Bib}TeX} is hard},
  journal = "Journal of " # latex,
  year = 2015

bib2sx yields the following:

$ bib2sx
((string (latex "LaTeX"))
  (author "Matthew Might")
  (title "Why parsing " '('"Bib" "TeX") " is hard")
  (journal "Journal of " latex)
  (year "2015")))
$ bib2sx --inline
  (author "Matthew Might")
  (title "Why parsing " '('"Bib" "TeX") " is hard")
  (journal "Journal of " "LaTeX")
  (year "2015")))
$ bib2sx --inline --flatten
  (author . "Matthew Might")
  (title . "Why parsing {{Bib}TeX} is hard")
  (journal . "Journal of LaTeX")
  (year . "2015")))
$ bib2sx --inline --json
{ "author": ["Matthew Might"],
 "title": ["Why parsing ",[["Bib"], "TeX"]," is hard"],
 "journal": ["Journal of ","LaTeX"],
 "year": ["2015"],
 "bibtexKey": "Might:2015:BibTeX" }
$ bib2sx --inline --flatten --json
{ "author": "Matthew Might",
 "title": "Why parsing {{Bib}TeX} is hard",
 "journal": "Journal of LaTeX",
 "year": "2015",
 "bibtexKey": "Might:2015:BibTeX" }
$ bib2sx --inline --bib
  author = { Matthew Might },
  title = { Why parsing {{Bib}TeX} is hard },
  journal = { Journal of LaTeX },
  year = { 2015 },
$ bib2sx --inline --flatten --xml
    <author value="Matthew Might">
    <title value="Why parsing {{Bib}TeX} is hard">
    <journal value="Journal of LaTeX">
    <year value="2015">
$ bib2sx --inline --xml
      Matthew Might
      Why parsing 
       is hard
      Journal of 

Related reading

  • Realm of Racket is an introduction to programming in Racket with an emphasis on game programming.