Wednesday, July 18, 2018

Citation.js: Use Case for a Wikidata GraphQL API

Citation.js: Use Case for a Wikidata GraphQL API

Citation.js has supported Wikidata input for a long time. However, I’ve always had some trouble with the API. See, when Citation.js processes Wikidata API output (which looks like this) and gets to, say, the P50 (author) property, it encounters this:

"P50": [
		"mainsnak": {
			"snaktype": "value",
			"property": "P50",
			"hash": "1202966ec4cf715d3b9ff6faba202ac6c6ac3df8",
			"datavalue": {
			"value": {
				"entity-type": "item",
				"numeric-id": 2062803,
				"id": "Q2062803"
			"type": "wikibase-entityid"
			"datatype": "wikibase-item"
		"type": "statement",
		"id": "Q46601020$692cc18d-4f54-eb65-8f0a-2fbb696be564",
		"rank": "normal"

The problem with this is that there’s no name string readily available: to get the name of this author, and of any author, journal, publisher, etcetera, Citation.js has to make extra queries to the API, to get the data.

In the case of people, you could then just grab the label, but there’s also P735 (given name) and P734 (family name) in Wikidata. That saves some error-prone name parsing, you might think. However, this is what the API output looks like:


Another two dead ends, another two (one, with some effort) API calls. It would be great if it was possible to get this data with a single API call. I think GraphQL would be a good option here. With GraphQL, you can specify exactly what data you want. I’m not the first one to think of this; in fact, a simple example is already implemented. This is what a query would look like (variables: {"item": "Q30000000"}): Try it online!

query ($item: ID!) {
  entry: item(id: $item) {
    # get every property
    # to get specific properties, use "statements(properties: [...])"
    claims: statements {
      mainsnak {
        ... on PropertyValueSnak {
          # get property id and English label
          property {
            name: label(language: "en") {
          # get value
          value {
            ... on MonolingualTextValue {
              value: text
            ... on StringValue {
            # if value is an item, get the label too
            ... on Item {
              label(language: "en") {
            ... on QuantityValue {
              unit {
                value: label(language: "en") {
            ... on TimeValue {
              value: time

Another handy thing is that the API output is basically the equivalent of the query in JSON, but with the data filled in. I think a GraphQL API would be really useful for this and similar use cases, and it definitely seems possible given the fact that there is an experimental API available.

Tuesday, July 17, 2018

Journal Metadata: Authors & Institutions

Journal Metadata: Authors & Institutions

I finished the General Plugin system for Citation.js a few days ago (more on that later), so I could finally publish a new beta release. Now, after that half-finished piece of code had been blocking other work for a long while, I can at last start… fixing bugs, and closing other items in the backlog.

One of the items that has been on the backlog for a long time, and was on the backlog of the previous major version too, was sorting out BibJSON. BibJSON has been “supported” since before CSL-JSON was introduced as the internal standard, but under the name of ContentMine JSON, as I only knew it as the output of ContentMine’s quickscrape tool.

quickscrape output
quickscrape output (source, license: MIT)

Since then, I learned it actually was a more standardised format, but never got to the act of reading the standard and updating the parser. Today, however, I did. Turns out, it is something in between JSON-LD and BibTeX. While searching around for more comprehensive documentation, I saw the journal-scrapers (used by quickscrape) again, which I used to compile some test cases.

Unfortunately, one of the first examples went wrong already. The meta tags, containing the bibliographical data that quickscrape scrapes, specifically data pertaining to the authors, are not structured in a machine-friendly way, in my opinion. Certainly, quickscrape has trouble with it.

<meta name="citation_author" content="P. Pandikumar"/>
<meta name="citation_author_institution" content="Division of Ethnopharmacology, Entomology Research Institute, Loyola College, Chennai, India"/>
<meta name="citation_author" content="S. Ignacimuthu"/>
<meta name="citation_author_institution" content="Division of Ethnopharmacology, Entomology Research Institute, Loyola College, Chennai, India"/>
<meta name="citation_author_institution" content="International Scientific Partnership Programme, King Saud University, Riyadh, Saudi Arabia"/>
<meta name="citation_author" content="N. A. Al-Dhabi"/>
<meta name="citation_author_institution" content="Addiriyah Chair for Environmental Studies, College of Science, King Saud University, Riyadh, Saudi Arabia"/>

This particular example is from Biomed Central. However, the pattern persists throughout multiple journals: Nature (example), PLOS One (example), PeerJ (example), and probably many more, as these were just the first four I checked.

Prepend view-source: to those example URLs to quickly view the HTML source, with the meta tags.

The pattern is so similar, especially the authors always being after a whole list of citation_references in the case of Nature and BMC, that there must be some sort of library or service that generates these, I thought. This quest first led me to search what kind of tags citation_ are. The fact that the answer wasn’t very easy to find and the amount of unanswered questions I found along the way quickly made it clear what kind of quest this was going to be.

First of all, the tags: they’re called HighWire Press tags. Normally I would link a website, but I don’t think there is any. They’re the preferred method of metadata tagging of Google Scholar, which lists 16 tags, they’re also the preferred format of Mendeley, which points to the Google Scholar documentation, and yet the only thing I find searching for some canonical list is people asking where that list could be, and getting no answers (1, 2).

Even with the 16-tag list, I can find at least two tags, each non-trivial (e.g. citation_reference and citation_author_institution), in any of the examples mentioned above, that aren’t on that list. Not to mention that, again, those examples weren’t chosen, they were picked semi-randomly.

Luckily, I’m not the first one to run into this problem. Someone previously compiled a list of 39 citation_ tags based on observations, which is very useful if I want to write a crosswalk for Citation.js sometime, but doesn’t really help with finding a generator.

Back to HighWire: they claim Nature is one of their customers. BMC and Springer Open, however, aren’t, and yet they share the same system, or a common standard that can’t be found anywhere else. That they share a system makes sense, but what system and/or standard are they using? I asked, and will report back when I get an answer.

Monday, April 30, 2018

Debugging: Deja Vu

Debugging: Deja Vu

I’m going to share some stories on debugging with you, because I’m proud of them. After writing up the first story, I’m no longer particularly proud, but I still want to share the story. Here’s the first: a bug that seemed quite familiar.

After some trouble with mocking API requests I decided that supporting mocking in the browser isn’t as important as supporting mocking at all, so I installed mock-require (I didn’t get proxyquire to work). Now, to confirm that the test bundle script actually omitted the API mocking code from the bundle, I loaded the test suite in the browser. Guess what? Errors everywhere! Or rather, error everywhere. Every test case gave the same error:

TypeError: Cannot assign to read only property 'length' of function 'function () {
          old.apply(self, arguments);
    at new Assertion (/build/test.citation.js:1810:30)
    at new Assertion (/build/test.citation.js:1799:25)
    at new Assertion (/build/test.citation.js:1799:25)
    at new Assertion (/build/test.citation.js:1799:25)
    at expect (/build/test.citation.js:1775:12)
    at Context._callee2$ (/build/test.citation.js:3582:35)
    at tryCatch (/build/citation.js:7891:17)
    at Generator.invoke [as _invoke] (/build/citation.js:8064:22)
    at Generator.prototype.(anonymous function) [as next] (/build/citation.js:7934:21)
    at step (/build/test.citation.js:3542:221)

Naturally, like a good programmer, I immediately googled the error message in conjunction with the various tools and frameworks I used for this bundle, instead of looking at the stack trace showing that something’s wrong with expect.js. Anyway, after some time, I found a GitHub issue describing exactly the problem I was having. Scrolling through the responses, I was stunned:

larsgw commented on Jul 29, 2017
+1: Same issue, but for all assertions

Somehow, I reported the same issue 10 months ago, but the site, running a bundle from 2 months old didn’t have the problem. Apparently, I had fixed the issue earlier, but forgotten about the solution, and I couldn’t figure out what that solution was. So I started debugging. First of all, the offending code wasn’t different from any other working bundle. It registered a bunch of functions as properties to a function, and it choked on the length property. Makes sense, the length property isn’t writable on functions. But running some simple test code showed this shouldn’t be the problem:

let f = function () {}
console.log(f.length)	// 0
f.length = 3			// (no error)
console.log(f.length)	// 0

Sure, it didn’t actually do anything, but it didn’t throw an exception either. Besides, this was the exact same lines of code as in the GitHub repo, so how could that be the issue? The only thing I could do was comparing with the working examples. Diffing my bundle against the published one, which was difficult, because it’s generated code. Checking out commits until I found out which one worked, which was a pain because I repeatedly forgot to reinstall dependencies, something that could be critical.

Sidebar (minor spoilers): the previous time I ‘solved’ the issue, it was much easier, because it was the first time setting the system up, so instead of referring to older commits as working examples, I looked at the docs.

After a long while I realised what the workaround was earlier: instead of bundling expect.js with Browserify, I included it as a script (as recommended) and created a wrapper that exposed the expect.js module and simply grabbed and exported the global expect variable exposed by the script. I thought this was because the order of requiring the scripts mattered, but some testing proved this wasn’t the case. No, actually including expect.js into a Browserify bundle with Babelify transform on, or even simply running it through the Babel compiler caused the error.

Back to diffing bundles I guess: what are the differences between a babel-ed file and its source code if there isn’t really any syntax that needs to be transformed, or APIs that need to be polyfilled?

Turns out, not much. Between those files, the only real difference was the location (or with comments: false the existence) of comments, and some style differences caused by Babel’s code generator.

And 'use strict'.

Apparently, 'use strict' makes assignments that otherwise fail silently throw an error. If I had read that documentation earlier, or if I had payed proper attention to the Function.prototype.length docs (linked above), I would have known. Now it’s just a boring ending to a long journey. But hey, at least I learned some stuff.

Solving this issue requires either a big change in the internals of a toolkit that hasn’t had an update in 4(!) years, or a workaround. I don’t want to use the workaround of including two extra files anymore, now that I know what is causing the issue, but the other workaround proves to be a problem itself, involving outdated documentation and a bug in Babelify. More on that later.

On a related note: I’m working on a new release for Citation.js, improving the parsing plugin system. The API should be pretty stable now, apart from the namespaces being prone to change, so I might change the schedule to one update with all current API changes instead.

Tuesday, February 6, 2018

DNA Project: Introduction


I’ve always had an interest in biology. When I was 11, I started collecting all sorts of info on conifers into a “book”. More recent projects include my ContentMine project, and an ongoing project on microscopic photography.

Plant cells Example photo (detail of the cross section of a basswood stem)

To my regret, however, I only recently got a decent introduction to the chemistry of DNA, and even then, there’s a lot left uncovered—obviously, it’s an introduction after all. It also raised a lot of questions, and gave me some new ideas, like: What if you were to thoroughly analyse an entire genome, figure out what does what, and then just… golfed it? The genome. What if you were to minify a genome? Those things were a bit out of scope of the class, however.

Luckily, there’s a thing called Profielwerkstuk in Dutch high school education. For your Profielwerkstuk you are ought to spend at least 80 hours on research into a topic of your choice. This includes setting up the project, reading literature, usually performing an experiment, writing the article, and, in the end, presenting your project.

Of course, I could spend that time on Citation.js, or on one of my other projects, but I’m probably going to spend that time on them anyway, and I want to learn something new. A perfect opportunity for me to answer some of those questions, and learn some more about DNA in the progress.

Later, I learnt about this paper, which is basically exactly the idea I mentioned above. Seeing the resources involved, it seems a bit out of the scope of my profielwerkstuk as well. However, there are plenty of other interesting things to figure out.

In the next few weeks, I will be reading literature to a) expand my basic knowledge on the subject past introduction-level and b) see what I could learn through experimentation, and which experiments I could perform in this project. My reading material includes:

I will also be putting out a series of blogposts, about inspirations, thoughts that came up while reading, ideas for the rest of the project, and more. Of course, I will also blog about the project itself.

In the meantime, if you are or know someone who can help me with actually performing those experiments (editing/sequencing DNA), please do contact me. General ideas, tips and feedback are of course also welcome.

This post is part of a series.

Sunday, February 4, 2018

Microscopic photography: Part 2

I promised more photos in the previous post, so here they are.



Penicillium with conidiophores
With conidiophores

Penicillium (conidiophore)
Detail of conidiophores

More fungi:
Fungus cells

Fungus cells

Cat brains:
Cat brain cells

This post is part of a series.

Friday, December 22, 2017

Citation.js Version 0.4 Beta: New Docs and Input Plugins

Citation.js Version 0.4 Beta: New Docs and Input Plugins

It’s been a while. Really. But now it’s back, with a new release: v0.4.0-0, the v0.4 beta. Below I explain some of the changes in this release, and then the road-map of Citation.js for v0.4 and v0.5. Also, Citation.js has a DOI now:


Also, @jsterlibs tweeted about Citation.js.


Input plugins

The main change in this release, is the addition of input plugins. This is the first step towards releasing v0.4, as explained below. Although there’s specific documentation available here and here, I’ll put an example here as well, as the API can be confusing.

The API for registering input plugins will be changed once or twice before the release of v0.4, once to make it less complex and weird, and possibly a second time to incorporate output plugins.

Adding a format

Say you wanted to add the RIS format (I should probably do that sometime). First, let’s define a type, to set things off.

const type = '@ris/text'

I can’t really define the actual parser here, but I’ll add a variable in the example code.

const parse = data => { /* return CSL-JSON or some format supported by any other parser */ }

Testing if a given string is in the RIS format, we’ll use regex. This regex matches if any line starts with two alphanumerical characters, two spaces and a hyphen. That’s a pretty fuzzy match; all lines should start with that sequence. However, this is just an example, and proper regex would be less elegant.

const risRegex = /^\w{2}\ {2}-/gm

Now, let’s define the dataType of RIS input. When using regex to test input, the dataType is automatically determined to be 'String' anyway, but for the sake of being clear:

const dataType = 'String'

Now, to combine it all:

Cite.parse.add(type, {
  parseType: risRegex,

Changing parsers

Now, say someone else wrote that code above, and you need to use it without modifying it, but you want a better regex? That can be arranged:

Cite.parse.add(type, {
  dataType: 'String',
  parseType: /^(?:\w{2}\ {2}-.*\n)+(\w{2}\ {2}-.*)?$/g

Because the options dataType, parseType, elementConstraint and propertyConstraint are all treated as one thing, you need to pass everyone of those when replacing the type checker. dataType is still not actually mandatory in this example, but is passed to demonstrate this.

Disabling a format

There is currently no way to remove a format.

In this scenario, some plugin registered a new format to get citation data from URLs. Unfortunately, it doesn’t work. Instead of recognising the URL as a GitHub-specific URL, the type is @else/url, the generic URL type. This is because the generic version was registered earlier, and there is not yet a category for generic types (see #104). If you don’t use the @else/url type checker anyway, you can disable it like this:

Cite.parse.add('@else/url', {dataType: 'String', parseType: () => false})

I’m not sure why the dataType is needed here, as you’re disabling the parser anyway, but it doesn’t seem to work without it.


Proper CLI docs

Recently, I’ve been updating the documentation on Citation.js, and it has been a pleasant experience, despite some hiccups in the documentation engine. There are tutorials now, and a lot of the JSDoc comments have been improved. I still want to improve some things in the theme, like clickable header links (similar to GitHub and NPM behaviour), and showing sub-tutorials in the navigation.

Backwards compatibility

This release should be largely backwards compatible. If there are any regressions, please report them in the bug tracker.



v0.4 “is about making it easier to expand on input and output formats, possibly by creating schemes and methods parsing those schemes that can, say, convert BibJSON to CSL JSON based on a JSON file, something that can be stored independently of implementation.”

The idea is to allow for all kinds of plugins, both in input parsing and output formatting, to be registered on the Cite object, and to treat all internal parsers and formatters the same. Currently, input parser plugins are possible (see above), although they will be improved before v0.4. Output parsers will be made soon too, and will be backwards-incompatible, because of the change in the output option format.


Currently, the plan is to use JSON-LD as the internal format in v0.5, while still keeping CSL-JSON as the internal scheme. Any input should then be converted to a list (or object) of fields, which will be added to the internal data store. Each field should be transformed to the CSL-JSON scheme individually, and added as a copy. As a result, there won’t be data loss when certain fields aren’t available in CSL-JSON, but are in the input and output schemes. It should also make it easier to write parsers for e.g. BibJSON.

Of course, edge cases should be taken care of. For example, there is no Wikidata property for the CSL-JSON field original-title, but that information can still be derived from the labels combined with the P364 (original language) property.

Monday, September 4, 2017

ctj rdf: Relations Between Conifers Mentioned in Articles

Below is part two of a small series on ctj rdf, a new program I made to transform ContentMine CProjects into SPARQL-queryable Wikidata-linked rdf.

Here’s a more detailed example. We are using my dataset available at 10.5281/zenodo.845935. It was generated from 1000 articles that mention ‘Pinus’ somewhere. This one has 15326 statements, whereof 8875 (57.9%) can be mapped, taking ~50 seconds. Now, for example, we can list the top 100 most mentioned species. The top 20:

Species Hits
Pinus sylvestris 248
Picea abies 177
Pinus taeda 138
Pinus pinaster 120
Pinus contorta 96
Arabidopsis thaliana 91
Picea glauca 77
Pinus radiata 77
Pinus massoniana 72
Pseudotsuga menziesii 65
Oryza sativa 56
Pinus halepensis 56
Pinus ponderosa 55
Pinus banksiana 53
Pinus koraiensis 53
Picea mariana 52
Pinus nigra 51
Pinus strobus 46
Quercus robur 45
Fagus sylvatica 45

The top of the list isn’t surprising: mostly pines, other conifers, other trees, Arabidopsis thaliana which I’ve seen represented in pine literature before, and Oryza sativa or rice, which I haven’t seen before in this context.

Note that only 248 of the 1000 articles mention the top Pinus species. This may be because the query getting the articles was quite broad. Also note that this doesn’t take into account how often an article mentions a species; a caveat of the current rdf output.

Going of this list, we can then look what non-tree or even non-plant species are named most often in conjunction with a given species, or, in this case, a genus. Top 20:

Species 1 Species 2 Co-occurences
Picea abies Pinus sylvestris 98
Picea abies Pinus taeda 56
Arabidopsis thaliana Pinus taeda 47
Picea glauca Pinus taeda 43
Arabidopsis thaliana Oryza sativa 43
Pinus pinaster Pinus taeda 41
Pinus pinaster Pinus sylvestris 41
Picea abies Picea glauca 41
Arabidopsis thaliana Picea abies 37
Pinus contorta Pinus sylvestris 36
Betula pendula Pinus sylvestris 36
Pinus sylvestris Pinus taeda 36
Pinus nigra Pinus sylvestris 35
Pinus contorta Pseudotsuga menziesii 32
Picea abies Pinus contorta 31
Picea abies Pinus pinaster 30
Arabidopsis thaliana Physcomitrella patens 30
Oryza sativa Pinus taeda 29
Pinus sylvestris Quercus robur 29
Picea abies Picea sitchensis 28

Interesting to see that rice is mostly mentioned with Arabidopsis. Let’s explore that further. Below are species named in conjunction with Oryza sativa.

Species Co-occurrences
Arabidopsis thaliana 43
Pinus taeda 29
Picea abies 24
Physcomitrella patens 23
Populus trichocarpa 21
Glycine max 20
Vitis vinifera 17
Picea glauca 17
Pinus pinaster 16
Selaginella moellendorffii 15
Pinus sylvestris 13
Triticum aestivum 12
Pinus contorta 10
Picea sitchensis 10
Ginkgo biloba 10
Pinus radiata 10
Ricinus communis 9
Amborella trichopoda 9
Medicago truncatula 9
Cucumis sativus 8

So attention seems divided between trees and more agriculture-related plants. More to explore for later.

View all posts in this series.