Friday, October 11, 2019

BibTeX Rework: Syntax Update

BibTeX Rework: Syntax Update

A rework of the BibTeX parser has been on the backlog since at least August 15, 2017, and recently I started working on actually carrying it out — systematically. There were a number of things to be improved:

  1. Complete syntax support: again, supporting BibTeX by looking at examples leads in a lack of support for less seen features like @string, @preamble and parentheses for enclosing entries instead of braces.
  2. More complete mappings: since I did not have any specifications when making the BibTeX parser, I could not find a complete list of fields, hence no complete mapping.
  3. Distinction between BibTeX and BibLaTeX: although there may not be any problems when importing, using year/month/day or date matters a lot if you have to output either.
  4. Proper schema validation: BibTeX defines required fields, but Citation.js does not check if those all exist.

In this blogpost, I will describe how I went about solving the first point: complete syntax support. Part of the problem was that I was running a bad parser, which was difficult to extend and not performing that well.

To improve this, I collected a number of BibTeX parsers to compare them on a number of criteria: performance, syntax support, build steps, and ease of maintaining. I used two single-entry BibTeX files for debugging, and a longer BibTeX file (5.2 MiB, 3345 entries) for some rough performance testing. The outcomes:

Using Time (single entry) Time (3345 entries) Syntax
Current TokenStack ~8ms ~1800ms old
Idea moo, Grammar ~2ms ~1150ms old
Idea (new) moo, Grammar ~3ms ~750ms new
Generator moo, nearley ~20ms N/A new
astrocite PEG.js ~9ms ~1670ms new
fiduswriter biblatex-csl-converter ~160ms ~119000ms new
Zotero BibTeX translator ~180ms ~31000ms old

So, the current parser was performing pretty well actually, especially compared to astrocite which I still consider a good target to aim for. TokenStack, however, was an unnecessarily complex part resulting in poor performance — and poor maintainability.

I had some trouble with PEG.js so I turned to other approaches. One thing I came across was nearley. However, this would introduce both an extra build step and an extra run-time dependency, and as the table shows did not perform very well. I assume that is on me, and my grammar-writing capabilities. One good thing that did come out of it was the use of a tokenizer or lexer, like moo.

After nearly finishing an approach using moo and Grammar, a simplified version of TokenStack with built-in support of rules, something else came up and I dropped the subject for about a year. However, recently I started over, saw my old approach and copied some stuff from there. This resulted in a even more simplified Grammar, with only matchToken, consumeToken and consumeRule support — no backtracking was needed in the end. Also, the performance was pretty good, and it was easier to implement the new syntax.

nearley grammar diagram
nearley grammar diagram

To make sure I had good results, I took some other parsers: Fidus Writer’s biblatex-csl-converter package and the Zotero BibTeX translator. The former was easy to set up, as it was just an npm package, while the latter involved quite some tricks: installing the Translation Server directly from GitHub, pointing an ENV variable to its config directory and running a piece of setup code, collecting all the translators I presume. Neither seemed to perform well in comparison to either my old parser, my new parser or astrocite, and I stress-tested all of them in terms of syntax:

@String {maintainer = "Xavier D\\'ecoret"}

{ "Maintained by " # maintainer }

@String(stefan = "Stefan Swe{\\i}g")
@String(and = " and ")

  Author =	 stefan # and # maintainer,
  title =	 { The {impossible} TEL---book },
  publisher =	 { D\\"ead Po$_{e}$t Society},
  year =	 1942,
  month =        mar

One area of expansion is all the ways BibTeX has to escape Unicode characters. Besides diacritics, which I should support completely, I think Zotero and astrocite are ahead in terms of completeness of symbols like \copyright. Then again, there is a great, really big list of LaTeX symbols, and not everyone needs every symbol — nor is everything represented in Unicode. I think the best way to do this is to expose function in the configuration to expand the default supported macros, but let me know if something else comes to mind.

The new parser, in its current form, has been published as part of Citation.js v0.5.0-alpha.3.