The tool: bib2sx
If you just want the tool, there’s no reason to read the full article.
bib2sx is on gitub.
By default, it takes a .bib
file on stdin and outputs S-Expressions on
stdout.
Flags unlock more features:
--inline
inlines@string
definitions;--flatten
joins string values together, preserving BibTeX quoting;--json
outputs JSON;--xml
outputs XML; and--bib
outputs canonicalized BibTeX.
Output format
The output format for the S-Expressions fits into:
<bibtex-ast> ::= (<item> ...)
<item> ::= (<item-type> <id> <attr> ...)
<attr> ::= (<name> <expr> ...)
| (<name> . <string>) ; with --flatten
<expr> ::= <string>
| '<expr>
| '(<expr> ...)
| <id>
<name> ::= <id>
<item-type> ::= inproceedings | article | ...
LaTeX resources
If you’d like to learn more about LaTeX (and BibTeX) in general, I recommend The LaTeX Companion:
If you want to get deep into the guts of TeX itself, Knuth’s very own reference is one of the few resources out there:
Why write a BibTeX parser?
I’m tired of manually extracting data from my BibTeX file.
In the past year, here’s a list of all the different formats into which I have had to manually transform my bibliographic data:
- generating my CV
- updating publications for my web site
- updating publications for my lab
- uploading publications for the faculty annual review site
- uploading publications produced under an NSF grant (x4)
- uploading publications for a DARPA quarterly report (x9)
- formatting publications for a DoE annual report
- formatting publications for an NSF biosketch (x2)
- formatting publications for an expanded NSF biosketch (x2)
- formatting publications for a DoE proposal
- formatting publications for an NIH biosketch (x2)
It’s insane to keep doing all of this manually.
The data is all inside of a BibTeX file.
It just needs to be moved into a format that’s easy to transform.
Why write yet another BibTeX parser?
Unfortunately, despite its apparent simplicity, there are a few quirks in the file format that make it challenging to properly use with pretty much any tool other than BibTeX.
I tried a few BibTeX parsing packages out there, only to find they all had subtle issues with formatting.
For instance, the built-in Racket BibTeX parser in scriblib/bibtex
doesn’t
preserve nested braces inside strings, and it deletes some whitespace, while
preserving other whitespace.
As a result, an entry like this:
@Article{Might:2015:BibTeX,
author = { Matthew Might },
title = { Why parsing {BibTeX} is hard },
journal = { Journal of {LaTeX} },
year = 2015
}
turns into this:
#hash((type . "Article")
("author" . " Matthew Might ")
("title" . " Why parsing BibTeXis hard ")
("journal" . " Journal of LaTeX")
("year" . "2015")))
Unfortunately, the nested braces have semantic value to the formatter: they’re telling it not to change capitalization when emitting bibliographic entries.
(One of the odd parts of BibTeX is its aggressively ignorant recapitalization in the service of formatting guidelines, and the nested braces serve as a manual override for that ignorance.)
I’ve run this tool through some sizeable BibTeX files from a variety of
sources, so the tool should be able to parse most .bib
files.
If you find a file it can’t handle, let me know.
Lexically analyzing BibTeX
As with languages such as Python (indentation sensitivity), JavaScript (regex versus division) and C (typename versus identifier), the complexity in BibTeX is easily handled by granting context-awareness to the lexical analyzer.
The case of BibTeX, the context of which the lexer needs to be aware is:
Am I currently inside a string literal value? (E.g.,
"foo bar"
)What is the current level of curly-brace nesting?
The same lexeme may become different tokens depending on the answer to these questions.
In short, inside a string literal, virtually everything (including whitespace) should be treated as a literal string token.
And, the same becomes true once the level of {}
-nesting is two deep or
more, except that the whitespace inside the second level of braces should
still be ignored.
A lexer in Racket
If you are not familiar with the lexical analysis package in Racket, I recommend my post on lexing in Racket.
The first step is to create abbreviations and token types for the lexer:
(define-lex-abbrev bibtex-id
(:+ (char-complement (char-set " \t\r\n{}@#,\\\""))))
(define-empty-tokens PUNCT (@ |{| |}| |"| |#| |,| =))
(define-empty-tokens EOF (EOF))
(define-tokens EXPR (ID STRING SPACE))
And, then, the lexer itself:
(define (bibtex-lexer port [nesting 0] [in-quotes? #f])
; helpers to recursively call the lexer with defaults:
(define (lex port)
(bibtex-lexer port nesting in-quotes?))
(define (lex+1 port)
; increase {}-nesting
(bibtex-lexer port (+ nesting 1) in-quotes?))
(define (lex-1 port)
; increase {}-nesting
(bibtex-lexer port (- nesting 1) in-quotes?))
(define (lex-quotes port)
; toggle inside quotes
(bibtex-lexer port nesting (not in-quotes?)))
(define (not-quotable?)
; iff not inside a string context
(and (not in-quotes?) (< nesting 2)))
{(lexer
[(eof)
empty-stream]
[(:+ whitespace)
(if (not-quotable?)
(lex input-port)
(stream-cons (token-SPACE lexeme)
(lex input-port)))]
["#"
(stream-cons (if (not-quotable?)
(token-#) (token-STRING lexeme))
(lex input-port))]
["@"
(stream-cons (if (not-quotable?)
(token-@) (token-STRING lexeme))
(lex input-port))]
["="
(stream-cons (if (not-quotable?)
(token-=) (token-STRING lexeme))
(lex input-port))]
[","
(stream-cons (if (not-quotable?)
(|token-,|) (token-STRING lexeme))
(lex input-port))]
[#\"
(cond
[in-quotes?
;=>
; pretend we're closing a {}-string
(stream-cons (|token-}|)
(lex-quotes input-port))]
[(and (not in-quotes?) (= nesting 1))
;=>
; pretend we're opening a {}-string
(stream-cons (|token-{|)
(lex-quotes input-port))]
[(and (not in-quotes?) (>= nesting 2))
;=>
(stream-cons (token-STRING lexeme)
(lex input-port))])]
["\\"
(stream-cons (token-STRING "\\")
(lex input-port))]
["\\{"
(stream-cons (token-STRING "{")
(lex input-port))]
["\\}"
(stream-cons (token-STRING "}")
(lex input-port))]
[(:: "{" (:* whitespace))
(begin
(stream-cons (|token-{|)
(if (and (<= nesting 1) (not in-quotes?))
(lex+1 input-port)
(if (= (string-length lexeme) 1)
(lex+1 input-port)
(stream-cons (token-SPACE (substring lexeme 1))
(lex+1 input-port))))))]
[(:: (:* whitespace) "}")
(begin
(stream-cons (|token-}|)
(if (and (<= nesting 2) (not in-quotes?))
(lex-1 input-port)
(if (= (string-length lexeme) 1)
(lex-1 input-port)
(stream-cons (token-SPACE (substring
lexeme 0
(- (string-length
lexeme) 1)))
(lex-1 input-port))))))]
[(:+ numeric)
(stream-cons (token-STRING lexeme)
(lex input-port))]
[bibtex-id
(stream-cons (if (not-quotable?)
(token-ID (string->symbol lexeme))
(token-STRING lexeme))
(lex input-port))])
port})
Parsing BibTeX
With most of the complexity moved to the lexer, the grammar for the parser is simplified:
<bibtex> ::= <item> ...
<item> ::= @ <id> { <taglist> }
<taglist> ::= <tag> , <taglist>
| <tag>
|
<tag> ::= <id>
| <id> = <expr>
<expr> ::= <atom> # <expr>
| <atom>
<atom> ::= <id>
| <string>
| <space>
| { <atom> ... }
It is straightforward to transcribe this lexer into Racket’s parser form:
(define bibtex-parse
(parser
[grammar
(itemlist [{item itemlist} (cons $1 $2)]
[{} '()]
[{ID itemlist} $2]
[{|,| itemlist} $2])
(item [{|@| ID |{| taglist |}|}
; =>
(cons (symbol-downcase $2) $4)])
(tag [{ID} $1]
[{ID = expr} (cons (symbol-downcase $1)
(flatten+simplify $3))])
(expr [{atom |#| expr} (cons $1 $3)]
[{atom} (list $1)])
(atom [{ID} (symbol-downcase $1)]
[{STRING} $1]
[{SPACE} $1]
[{ |{| atomlist |}| } (list 'quote $2)])
(atomlist [{atom atomlist} (cons $1 $2)]
[{} '()])
(taglist [{tag |,| taglist} (cons $1 $3)]
[{tag} (list $1)]
[{} '()])]
[tokens PUNCT EOF EXPR]
[start itemlist]
[end EOF]
[error (lambda (tok-ok? tok-name tok-value)
(error (format "parsing error: ~a ~a ~a"
tok-ok? tok-name tok-value)))]))
Once the parser runs, the BibTeX database is available as a giant, easy-to-manipulate S-Expression.
It is worth noting that to preserve the structure of BibTeX strings, strings are split into lists of “expressions.”
An expression can be a literal string, an identifier (assignable by
@string
), a quoted string (indicating {}
-wrapping) and a quoted
sequence of expressions.
The tool bib2sx
comes equipped with several useful transformers,
including XML, JSON and BibTeX.
Examples
Given the following:
@string{
latex = "LaTeX"
}
@Article{Might:2015:BibTeX,
author = "Matthew Might",
title = {Why parsing {{Bib}TeX} is hard},
journal = "Journal of " # latex,
year = 2015
}
bib2sx
yields the following:
$ bib2sx
((string (latex "LaTeX"))
(article
Might:2015:BibTeX
(author "Matthew Might")
(title "Why parsing " '('"Bib" "TeX") " is hard")
(journal "Journal of " latex)
(year "2015")))
$ bib2sx --inline
((article
Might:2015:BibTeX
(author "Matthew Might")
(title "Why parsing " '('"Bib" "TeX") " is hard")
(journal "Journal of " "LaTeX")
(year "2015")))
$ bib2sx --inline --flatten
((article
Might:2015:BibTeX
(author . "Matthew Might")
(title . "Why parsing {{Bib}TeX} is hard")
(journal . "Journal of LaTeX")
(year . "2015")))
$ bib2sx --inline --json
[
{ "author": ["Matthew Might"],
"title": ["Why parsing ",[["Bib"], "TeX"]," is hard"],
"journal": ["Journal of ","LaTeX"],
"year": ["2015"],
"bibtexKey": "Might:2015:BibTeX" }
]
$ bib2sx --inline --flatten --json
[
{ "author": "Matthew Might",
"title": "Why parsing {{Bib}TeX} is hard",
"journal": "Journal of LaTeX",
"year": "2015",
"bibtexKey": "Might:2015:BibTeX" }
]
$ bib2sx --inline --bib
@article{Might:2015:BibTeX,
author = { Matthew Might },
title = { Why parsing {{Bib}TeX} is hard },
journal = { Journal of LaTeX },
year = { 2015 },
}
$ bib2sx --inline --flatten --xml
<bibtex>
<article>
<bibtex-key>
Might:2015:BibTeX
</bibtex-key>
<author value="Matthew Might">
</author>
<title value="Why parsing {{Bib}TeX} is hard">
</title>
<journal value="Journal of LaTeX">
</journal>
<year value="2015">
</year>
</article>
</bibtex>
$ bib2sx --inline --xml
<bibtex>
<article>
<bibtex-key>
Might:2015:BibTeX
</bibtex-key>
<author>
Matthew Might
</author>
<title>
Why parsing
<quote>
<quote>
Bib
</quote>
TeX
</quote>
is hard
</title>
<journal>
Journal of
LaTeX
</journal>
<year>
2015
</year>
</article>
</bibtex>
Related reading
- Realm of Racket is an introduction to programming in Racket with an emphasis on game programming.