Friday, October 11, 2019

BibTeX Rework: Syntax Update

BibTeX Rework: Syntax Update

A rework of the BibTeX parser has been on the backlog since at least August 15, 2017, and recently I started working on actually carrying it out — systematically. There were a number of things to be improved:

  1. Complete syntax support: again, supporting BibTeX by looking at examples leads in a lack of support for less seen features like @string, @preamble and parentheses for enclosing entries instead of braces.
  2. More complete mappings: since I did not have any specifications when making the BibTeX parser, I could not find a complete list of fields, hence no complete mapping.
  3. Distinction between BibTeX and BibLaTeX: although there may not be any problems when importing, using year/month/day or date matters a lot if you have to output either.
  4. Proper schema validation: BibTeX defines required fields, but Citation.js does not check if those all exist.

In this blogpost, I will describe how I went about solving the first point: complete syntax support. Part of the problem was that I was running a bad parser, which was difficult to extend and not performing that well.

To improve this, I collected a number of BibTeX parsers to compare them on a number of criteria: performance, syntax support, build steps, and ease of maintaining. I used two single-entry BibTeX files for debugging, and a longer BibTeX file (5.2 MiB, 3345 entries) for some rough performance testing. The outcomes:

Using Time (single entry) Time (3345 entries) Syntax
Current TokenStack ~8ms ~1800ms old
Idea moo, Grammar ~2ms ~1150ms old
Idea (new) moo, Grammar ~3ms ~750ms new
Generator moo, nearley ~20ms N/A new
astrocite PEG.js ~9ms ~1670ms new
fiduswriter biblatex-csl-converter ~160ms ~119000ms new
Zotero BibTeX translator ~180ms ~31000ms old

So, the current parser was performing pretty well actually, especially compared to astrocite which I still consider a good target to aim for. TokenStack, however, was an unnecessarily complex part resulting in poor performance — and poor maintainability.

I had some trouble with PEG.js so I turned to other approaches. One thing I came across was nearley. However, this would introduce both an extra build step and an extra run-time dependency, and as the table shows did not perform very well. I assume that is on me, and my grammar-writing capabilities. One good thing that did come out of it was the use of a tokenizer or lexer, like moo.

After nearly finishing an approach using moo and Grammar, a simplified version of TokenStack with built-in support of rules, something else came up and I dropped the subject for about a year. However, recently I started over, saw my old approach and copied some stuff from there. This resulted in a even more simplified Grammar, with only matchToken, consumeToken and consumeRule support — no backtracking was needed in the end. Also, the performance was pretty good, and it was easier to implement the new syntax.

nearley grammar diagram
nearley grammar diagram

To make sure I had good results, I took some other parsers: Fidus Writer’s biblatex-csl-converter package and the Zotero BibTeX translator. The former was easy to set up, as it was just an npm package, while the latter involved quite some tricks: installing the Translation Server directly from GitHub, pointing an ENV variable to its config directory and running a piece of setup code, collecting all the translators I presume. Neither seemed to perform well in comparison to either my old parser, my new parser or astrocite, and I stress-tested all of them in terms of syntax:

@String {maintainer = "Xavier D\\'ecoret"}

{ "Maintained by " # maintainer }

@String(stefan = "Stefan Swe{\\i}g")
@String(and = " and ")

  Author =	 stefan # and # maintainer,
  title =	 { The {impossible} TEL---book },
  publisher =	 { D\\"ead Po$_{e}$t Society},
  year =	 1942,
  month =        mar

One area of expansion is all the ways BibTeX has to escape Unicode characters. Besides diacritics, which I should support completely, I think Zotero and astrocite are ahead in terms of completeness of symbols like \copyright. Then again, there is a great, really big list of LaTeX symbols, and not everyone needs every symbol — nor is everything represented in Unicode. I think the best way to do this is to expose function in the configuration to expand the default supported macros, but let me know if something else comes to mind.

The new parser, in its current form, has been published as part of Citation.js v0.5.0-alpha.3.

Friday, August 16, 2019

Debugging the Karmabug

Debugging the Karmabug

For reasons I will not go through right now, I needed a new library for making synchronous HTTP requests in Node.js. I know what you are saying, “But that’s one of the seven deadly sins of JavaScript!” 1 Well, just know I had my reasons, and I wanted to replace sync-request.

Since I was already using the Fetch API with node-fetch in the async part of my library, I thought: why not build a sync-fetch, using node-fetch under the hood like sync-request uses (then-)request. Two days later, it was actually working, as far as I could tell. However, if I wanted to publish this horror I need some actual testing.

Luckily node-fetch has a nice test suite, I only needed to convert 194 test cases to use the synchronous API. Not fun work, but worth its while, maybe. Anyway, the first test cases worked, but then it got stuck on the first actual request.

This is were I have to introduce you to the Karmabug. You see, after some testing I figured out that only the combination of my sync-fetch and the test server just… stopped. The arguments were correct, my fetch works with and the test server works with node-fetch, but this combination simply did not. Investigating either pointed to the other, and I had no idea what to do next.

That would make a good tweet, I thought. “Karma for making sync HTTP requests I guess.” Literally two minutes later it hit me: that was exactly what was going on. The test server could not respond to the requests because the request itself was blocking the event loop.

1 Incidentally, the seven deadly sins of JavaScript all happen to be Sloth.

Monday, August 5, 2019

Citation.js: RIS Rework Pt. 2

Citation.js: RIS Rework Pt. 2

In the last post I explained how I started implementing the RIS specification that I found in the Internet Archive, only to discover that there is an older specification, which seems to be more common at times.

Now, I have implemented the old spec, which luckily was not nearly as complex. One thing that came to my attention was that there were a lot of redundant tags: for title, there are TI, T1, CT and usually BT; for journal names there’s JO & JF, and JA, J1, & J2 for abbreviated journal names. While I can imagine some nuanced difference in meaning between those tags, those meanings are not documented, and not trivial to figure out either. Anyway, that is not a problem for the implementation.

I also updated the implementation of the new spec, to fix some mistakes and add some more mappings. In addition, because in real life there seem to be some implementations that export a mix of the two specifications, I created an implementation based on the new spec, that if needed can defer to the old one — and some random properties that Wikipedia and Zotero have picked up somewhere, and are not in either spec.

How do the results look? First of all, the example that was giving me issues in the last post looks a lot nicer now:

{ issue: '1',
  page: '230-265',
  type: 'article-journal',
  volume: '47',
  title: 'On computable numbers, with an application to the Entscheidungsproblem',
  author: [ { family: 'Turing', given: 'Alan Mathison' } ],
  issued: { 'date-parts': [ [ 1937 ] ] },
  'container-title': 'Proc. of London Mathematical Society' }

The only thing that was missing when I first tried this out was the end of the page range, because SP in the new spec is the entire page range, while in the old spec you need both SP and EP. I had to fix that manually — not a problem, just something to keep in mind when re-running the scripts.

One other thing to check was how to the mappings look from above, without all the type-specific shenanigans. I keep a (public) spreadsheet with mappings from CSL-JSON to all kinds of different formats, so I added the RIS mappings. So, a sanity check. Does it make sense?

RIS mappings

No, not at all! The RIS tag SE is mapped to 10 different CSL variables, and the CSL number variables is mapped to 9 different RIS tags. To me, it does not make any sense, even accounting for the fact that I know there is some variation between entry types.

The question that remains is, does it work? Even if it does not look like it makes sense, the output could still make sense, if other implementations follow the specification to a similar degree. I know Zotero does not entirely follow it — all the spec anomalies that are implemented are attributed to weird EndNote quirks, not the weird spec quirks.

That made me wonder to what degree the EndNote implementation follows the specification. However, I do not have EndNote, so this is a call to action! Can you help me with clearing up the RIS cloud for me by submitting your RIS exports to a CC0-licensed repo? Preferably with all kinds of reference types — articles, books, chapters, conference papers, webpages, software, patents, bills, maps, artworks, whatever you can find. For legal reasons, please replace abstracts and other copyrightable content by [... omitted].

In the meantime, I will be collecting RIS exports from other sources, like Zotero and Mendeley and websites like Google Scholar, BMC and PubMed Central. If you know of any other sources, please let me know!

Tuesday, July 30, 2019

Citation.js: RIS Rework Pt. 1

Citation.js: RIS Rework Pt. 1

So a while ago I was looking around for the RIS specification again. I had not found it earlier, only a reference implementation from Zotero, a surprisingly complete list of tags and types on Wikipedia and some examples from various websites and programs exporting RIS files. They did not seem to go together well, however. There were some slight differences in tags here and there, and a bunch of useful tags listed by Wikipedia were labelled “degenerate” in the Zotero codebase, and only used for imports — implying some sort of problem.

What could be going on? Well, I checked out the references on the Wikipedia page again, to see if there really was no official specification or some other more reliable source where it got its information from. And, suddenly, there was an actual source this time. I do not know how I missed it earlier, but there was a page (archived) that linked to a zip file containing a PDF file with general specifications and an Excel file with sheets with property lists for all different types.

That sounded useful, so I spent waaayy to much time automating a script to turn those sheets — with a bunch of user input — into usable mappings for Citation.js. I just finished that today, apart from some… questionable mappings, but I wanted to at least test the final script with an example. As for the results, well, see for yourself. The example, from the Wikipedia page (CC-BY-SA 3.0 Unported) was

T1  - On computable numbers, with an application to the Entscheidungsproblem
A1  - Turing, Alan Mathison
JO  - Proc. of London Mathematical Society
VL  - 47
IS  - 1
SP  - 230
EP  - 265
Y1  - 1937
ER  -

and my results were

{ issue: 1, page: 230, type: 'article-journal', volume: 47 }

That looked really weird and disappointing. Again, what could possibly be going on here? The example on Wikipedia is using T1, A1, JO and Y1 while the specs say to use TI, AU, T2 and PY here. Where are these differences coming from?

After some digging around on Wikipedia I found a comment saying that there are in fact two specifications: one from 2011 and one from before. The archived spec I checked out was from 2012 (as linked by Wikipedia!), while they use the version from before 2011; which luckily is still available. To be continued.

Wednesday, June 19, 2019

Citation.js: Usability Update

Citation.js: Usability Update

Citation.js just had a bunch of tooling updated, which should make a lot of use cases easier. Let’s go through them:

Replacer (GitHub, Demo)

Replacer demo

Replacer is an HTML API for Citation.js that I recently updated for the new version so that it works with components. Basic usage:

  data-input="Wikidata ID/DOI/ISBN/GitHub repo/..."
  Fallback text for if things go wrong. This is treated
  as input if data-input is ommitted.

<script src="replacer.js">

Additional features:

To get a replacer.js with the tools you want, use the Bundle tool:

Bundle Tool (GitHub, Demo)

Bundle tool screenshot

The Bundle tool is a small website where you can compose a bundle of Citation.js components, choosing exactly what you need. You can pick the plugins you want, choose to add Replacer functionality or leave the core out altogether to split some files.

After clicking Create Bundle you are redirected to a page where you can download your live-created bundle. Please do not include that script on your page; there is no caching so every time you load your page it would have to re-run Browserify and that could be a bit problematic for other users and for Glitch, where it is hosted.

To still be available in such cases, and because I have to manually update component versions and I might not have done that yet at that moment, you can host it yourself to. The code is on GitHub, and npm install && npm start should work perfectly fine. Alternatively, you can remix the project on Glitch, which is probably a lot easier (just one click).

CLI (GitHub)

The CLI is also updated to include some of the new options. Specifically, with some tricks special output options are now supported, as well as input options and plugin configuration. For example:

# Prefer Japanese when picking Wikidata labels
citation-js --plugin-config "@wikidata.langs=[ja,en]"

# Force the parser to assume it is a generic URL (and not a
# specific one)
citation-js --input-force-type "@else/url"

# Load the ISBN plugin (not included by default)
# Note: this still loads the plugins that are included by default
citation-js --plugins isbn

# Do not sort the bibliography with the algorithm specified by the
# style/template, but rather keep the original (input) order.
# Note: you still need "-s citation-apa" et cetera
citation-js --formatter-opts "nosort=false"

So, hopefully that improves the ways people use Citation.js. As always, feedback is welcome on the various GitHub repos or on Gitter.

Saturday, May 25, 2019

Citation.js: Wikidata Update

Citation.js: Wikidata Update

The new update, v0.4.4, contains a few Wikidata improvements (commit):

  • 22 new mappings
  • 2 fixed mappings (ISSN did not work and publisher-place was mapped to the wrong thing)
  • 2 improved mappings (container-title for chapters and more URL mappings)
  • 1 removed mapping (genre was inconsistent with the intended use, although it followed the specification)

Because most of these mappings require additional resources (recipient has to fetch people, review-* has to fetch the review subject and original-publisher-place has to fetch the country and location of the publisher of the original version of a work) this would have caused a lot more requests to the API. It is a good thing, then, that there is a new resolver in place, which pre-fetches such information for all requested works at once.

This has to be done in levels; you cannot fetch the country if you do not have the location, the location requires the publisher and to find out the publisher you have to have the work already. However, temporarily ignoring the cap of 50 items per request, the number of requests is now only based on those levels (which is at most five), instead of the number of claims requiring additional requests.

For example, item Q21972834 (Assembling the 20 Gb white spruce (Picea glauca) genome from whole-genome shotgun sequencing data) takes:

  • 6 requests in v0.4.0-rc.2, before requests for properties were grouped (so it would make a request for every author)
  • 3 requests in v0.4.3: one for the work, one for the authors, and one for journal
  • 2 requests in v0.4.4, which includes a bunch of new mappings that would have cost a total of 5 requests in the previous version

Grouping the requests from all works has the added benefit of not having to fetch the same author and journal info repeatedly — which can be quite common for bibliographies. For example, the bibliography of Groovy Cheminformatics with the Chemistry Development Kit, currently containing 74 items, takes 11 requests in the new version. For the same bibliography, the previous version would make at least 150 requests — for each item one for all the authors and one for the journal — and probably more than that.

[core] GET {}  
[core] GET {}  
[core] GET {}  
[core] GET {}  
[core] GET  480%7CQ1709878%7CQ900502&format=json&languages=en {}  
[core] GET {}  
[core] GET {}  
[core] GET {}  
[core] GET {}  
[core] GET {}  
[core] GET {}  

However, I do notice slightly more requests with a low number of items than I would expect. I will look into that.

Conference papers in Wikidata

Part of the mappings that were added in the recent update are the event-* properties, i.e. “name of the related event (e.g. the conference name when citing a conference paper)” (spec). Finding out how those are modelled in Wikidata proved a bit of a challenge, as most instances of Q23927052 (conference paper) are published in proceedings with the book type. To get some numbers (query):

type typeLabel items
Q571 book 715
Q5633421 scientific journal 676
Q1143604 proceedings 315
Q16024164 medical journal 53
Q23927052 conference paper 48
Q1002697 periodical literature 32
Q41298 magazine 29
other other 61

48 conference papers are published in conference papers? That does not seem right. But maybe they just have the wrong type, but still link to the event? Well no, not really (query).

hasEvent items
false 1952
true 6

Luckily, the information is not lost: usually, the label contains location and date information about the event, although extracting that would be probably be a tedious task. However, perhaps there are a lot of proper proceedings out there, but the articles linked to it just have the wrong type (query)?

type typeLabel items
Q13442814 scholarly article 3513
Q23927052 conference paper 315
Q3331189 version, edition, or translation 11
Q1143604 proceedings 9
Q10885494 scientific conference paper 7
Q1980247 chapter 6
Q191067 article 5
Q18918145 academic journal article 4
Q333291 abstract 3
other other 10

That seems to be the case: most works that are published in instances of proceedings are tagged as instances of scholarly articles, and only nine are tagged as both. And: relatively many have events linked to them (query).

hasEvent items
true 2692
false 1184

That’s better. I’ll look into working together with WikiCite to work on the rest.

Thursday, April 11, 2019

Citation.js Version 0.4: The Catch-up

Citation.js Version 0.4: The Catch-up

After about two years, finally v0.4.0 is ready for a release. Whether it is a real milestone given that there have been prereleases for about two years, while 0.3.x only lasted three weeks, is for you to decide but I am glad that everything I planned for this version is implemented now.

Citation.js logo


So, let’s go through the major changes.


Most importantly, the code is fully modulated now. The different components, the core, the CLI and all the individual format plugins, are available on their own now. The citation-js is now designed as a shell around those components, maintaining full backwards compatibility (apart from mapping changes) until that is no longer required.


A lot of different formats where either improved or introduced. The main newcomer is RIS, still only as output format. Support for NDJSON output was also added. The handling of styling and case-preserving brackets in BibTeX fields was improved, although the method of parsing those files is still lagging behind.

Thanks to (independent) practical testing by Egon Willighagen and Jakob Voss and the wikidata-sdk library by Maxime Lathuilière we were also able to improve the Wikidata mapping quite a lot, although there is still work ahead.


Apart from the potential for customization that comes with the separate modules, there is quite some new configuration available. First of all, input parsing has some options: strict which basically switches between errors and failing silently (not-so-long ago the default behavior); and target which allows the user to specify a certain point at which the parsing should be stopped, mainly useful for debugging.

Furthermore, individual input formats can be configured now too. For example, the default languages used in the Wikidata plugin can be configured now. The methods to add CSL templates and locales which was already available is using the same mechanism now.

Finally, CSL output with citeproc-js has been amended. First and foremost, support for citation has been added (or rather, the original citation-* has been renamed to bibliography), so it is possible to get actual citations now. Also, the nosort option has been added and the prepend/append options been improved.

Stability & best practices

Also important, there’s a lot more testing now and the commit messages and code style follow standard guidelines. Making a new repository for the different modules actually helped with that, allowing me to re-evaluate the decisions I made in that aspect.

The full changelog is available here.

Paper & Preprint

In the middle of all that (literally, the development halted for a few months) I wrote a paper about Citation.js, which I am currently revising after review. The preprint is available here.

Further development

There’s still a lot to be done, most of which has been discussed in the “Discussion” section of the preprint. I’ll go into some things that are planned for the (relatively) near future, other than additional mappings of course.

(Another) Wikidata refactor

Part of the recent development to finish the release was a refactoring of the Wikidata parsing code to facilitate some changes and mainly to reduce unnecessary code duplication, originally the result of sync/async variants of almost every function. However, while working on that I came up with an even better refactor which should minimize the problems of under-fetching that I described in the preprint.

To recap, under-fetching is the problem of not being able to fetch enough information in the desired amount of HTTP requests. For example, in Wikidata it’s not possible to get item labels of property values such as P50 (author), and they require another HTTP call. It is not possible to fetch them together with the publication item you are fetching since you do not know who the authors are yet.

So the minimal number of requests (not taking the limit of 50 items per request) is one for each “level” you need to go down: author labels would be the second level, and fetching the labels of name items if you decide to use P735 (given name) and P734 (family name) would be the third. That’s what this refactor will try to accomplish, and it shouldn’t be too complex but still non-trivial.

Turning the chain parser into a tree parser

Currently the input is parsed iteratively, in a “chain” of types. For example, a Wikidata ID becomes a Wikidata API URL, which returns JSON, which gets parsed, and the resulting API response gets transformed into CSL-JSON. This also ensures that people could input a Wikidata API URL if they want, and that fetching and parsing JSON from a web resource doesn’t have to be implemented over and over again.

When encountering arrays, if they are not recognized as some special array, each element is parsed until the target (CSL-JSON) is reached, at which point it returns. However, this proved problematic with some new features, like the target option (parse until this format is reached, instead of the default CSL-JSON). Firstly, such options cannot be passed down to the other parsing functions, and secondly if a certain target format is reached in the array elements then the total format is always going to be an array of that, and will never match the target.

A refactor of the parser, branching it instead of whatever is going on currently, should counter this.

More citeproc interactions

citeproc is currently used in an almost state-less manner, which works fine for bibliographies but is quite limiting for working with citations. While users could redirect CSL output to citeproc themselves, why not do it for them with more bibliography management, which Citation.js mostly lacks at the moment.

Ditching CSL-JSON as the central format?

CSL-JSON is great, but also quite limiting considering the available properties, the general problems of which are also discussed in the preprint. While additions could be made to CSL in terms of variables, which Frank Bennet has already done with CSL-M, and which the CSL team is doing now with a new release finally specifying a computer program type, CSL remains intended for generating citations, not for storing bibliographical data. At least, according to a quote that I cannot find anymore.

However, this becomes quite clear practically. I count 10 of the 78 fields being specific to citing the item. An additional 4 are (usually) redundant and 3 poorly defined, one with just a question mark as description. On top of that, 12 could be replaced by allowing references to other publications, which would require just 4 fields instead, and 4 that could be replaced by other entities such as venues requiring no additional properties.

Still, moving to a different format is quite something and this will have to be discussed. Of course, support for CSL-JSON wouldn’t be removed either way, just not exclusively used for storage. Also, these changes would take place way after any pending changes.

And that is it for this post. Any feedback on the release and ideas for the things discussed above is very welcome here, in the GitHub repo or on Twitter (@larswillighagen).

Thursday, January 3, 2019

Citation.js: Wikidata Subclasses

Citation.js: Wikidata Subclasses

I am in the process of creating a better way to map CSL types to Wikidata types together with Jakob Voß, and while listing all subtypes of web page, I came across a number of these cases (link):

SELECT ?a ?aLabel ?b ?bLabel WHERE {
  wd:Q58803899 wdt:P279* ?a .
  ?a wdt:P279 ?b .
    ?b wdt:P279* wd:Q386724 .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }

I’m no expert on Wikidata data models, nor on the semantics of each of those items in specific, but I’m pretty sure Count of Barcelos (Q58803899) is not a subclass of web page (Q36774). However, the individual parts seem pretty reasonable, so where does it go wrong then?

Luckily, the description of subclass of (P279) provides some good guidance:

all instances of these items are instances of those items; this item is a class (subset) of that item.

So let’s check those:

  1. All counts of Barcelos are Portuguese counts (conde), as Barcelos lies in Portugal
  2. All condes are counts
  3. Counts are not hereditary titles themselves, so I think this should be a instance of (P31). Also, counts are indirectly subclass of royal or noble rank through hereditary titles, but directly instance of royal or noble rank.
  4. Hereditary titles do not seem like an exact subclass of royal or noble ranks to me, but it passes the basic test
  5. I am unsure about royal or noble rank as subclass of rank, since the latter seems way more abstract, but perhaps that’s the point
  6. (rank -> first-order metaclass) I have no idea what the intention with these metaclasses is, so I’m going to assume this is all correct (bold move)
  7. first-order metaclass -> fixed-order metaclass
  8. fixed-order metaclass -> Wikidata metaclass: “class of classes, class whose instances are classes”
  9. Wikidata metaclass -> class or metaclass of Wikidata ontology (should maybe be P31 as well, still no clue though)
  10. class or metaclass of Wikidata ontology -> Wikidata item: finally back into some form of concrete-ness. Problem being, not all instances of royal or noble ranks are Wikidata items, so something went wrong somewhere in between, maybe even at (6).
  11. Wikidata item -> Wikidata internal entity
  12. Wikidata internal entity -> Wikidata internal item
  13. Wikidata internal item -> MediaWiki page
  14. MediaWiki page -> web page: actually pretty reasonable, instances of Wikidata items are web pages.

There are a little less than 900 instances of web page so the impact is smaller than I had expected, but it’s still annoying. There are 191 subclasses of conde, not to mention the total of 280 royal or noble ranks that have no business being a subclass of web page.