Syntaxus baccata: October 2016

Friday, October 28, 2016

Citation.js Version 0.2: NPM package, CSL and more

In the last two weeks I've been busy with making Version 0.2 of Citation.js. Here I'll explain some of the changes and the reasoning behind them.

In the past months I've updated Citation.js several times, and the changes included a Node.js program for the commandline and better Wikidata input parsing. While I was working with the "old" code, I noticed some annoying issues in it.

One of the biggest things was the internal data format. When Cite(), the main program, parses input, makes it into JSON with a standardised scheme, which is used everywhere else in the program, e.g. for sorting and outputting. The scheme I used was something I made up to accommodate for the features it had back when there was next to no input parsing, and you were expected to input JSON directly, either by file or by webform. It wasn't scalable at all, and some of the methods were patched so much they only worked in specific test cases.

Old interface of Citation.js (pre-v0.1). It would fetch the first paragraph of Wikipedia about a certain formatting style. Many of the supporting methods of this version stayed in v0.1.

Now I use CSL-JSON, the scheme used by, among others, citeproc-js, and the standard described by the Citation Style Language. It is designed by professionals, or at least people more qualified to making standards. It is quite similar to my old scheme, with some exceptions. Big advantages are the way it stores date and person information. Before, I had to hope users provided names in the correct format. Now, it doesn't matter, as it gets parsed to CSL. The same goes for the date. Another advantage is the new output potential. Besides the output of CSL-JSON, it is now possible to use citeproc-js directly, without extra conversion.

Using a new data format also meant a lot of cleanup in the code. Almost all methods had to be re-written to account for it, but this was mostly a chance to write them more properly. Now, only Cite() exists in the global scope, which is good, because it means other parts don't take up variable names, etc. The entire program is now optimised for both browser and Node.js use, although it uses synchronous requests. From the perspective of the program it is necessary to be able to use synchronous requests. However, it is possible for users to bypass this for a big part. It is mostly used for >Wikidata parsing. An example:

Let's take the input "Q21972834". This is a Wikidata Entity ID, and it points to Assembling the 20 Gb white spruce (Picea glauca) genome from whole-genome shotgun sequencing data (10.1093/bioinformatics/btt178). If Cite() only has the ID, it has to fetch the corresponding data (JSON). Because Cite() is called as a function, and is expected to return something, it has to make the request synchronously. However, if the user fetches the data asynchronously and calls Cite() in the callback, that is bypassed.

var xhr = new XMLHttpRequest();

xhr.open(
  /* Method */ 'GET',
  /* URL    */ 'https://www.wikidata.org/wiki/Special:EntityData/Q21972834.json',
  /* Async  */ true
)

xhr.addEventListener( 'load', function () {
  var data = this.responseText
  
  var cite = new Cite( data )
  
  // Etc...
} );

xhr.send( null )

The JSON gets to Cite() with only async(hronous) requests. The problem is that this JSON doesn't contain everything. Instead of the name of the journal it's published in, it contains a numeric reference to it. To get that name, I have to make another request, which has to be sync as well. I hope there is some way in the Wikidata API to turn references off and names (or "labels") on, but I haven't found one yet. That being said, I had to search a long time to find the on-switch for cross-domain requests in the Wikidata API as well, so it might be hidden somewhere. If that's the case then sync requests can be bypassed everywhere, which would be nice, as browsers are on the verge of dropping support.

Probably the biggest news is that it is a NPM (Node Package Manager) package (link). This means you can download the code without having to clone the git repository, or copy the source file. It's even available online, although the default npm demo seems to be broken. Luckily, I have a regular demo as well. As of writing, the npm page says the package has been downloaded 234 times already, but that number has been the same for a day, so I guess there is an issue with npm. If not, that's really cool.

Sunday, October 9, 2016

Weekly Report 10: Visualising facts and asking questions

Earlier this week, tarrow published factvis, short for fact visualisation. I decided to have a go with the design, and I made this, in the style of cardlists. Note: If my version and tarrow's version of factvis look very similar, my changes are probably pushed to the master branch already.

Screenshot of my factvis design

The facts being visualised come from the ContentMine. It publishes facts about things related to zika, extracted from papers, on Zenodo. A fact has the following structure:

{
  "_index": "facts",
  "_type": "snippet",
  "_id": "AVdDntnH_8VqgcuJwvpW",
  "_score": 1,
  "_source": {
    "prefix": "icle-title>Mosquitos (Diptera: Culicidae) del ",
    "post": "</article-title><source>Entomol Vect</source><y",
    "term": "Uruguay",
    "documentID": "AVdDnq-oJ9hGurOzZIZE",
    "cprojectID": ["PMC4735964"],
    "identifiers": {
      "contentmine": "CM.wikidatacountry8",
      "wikidata": "Q77"
    }
  }
}

As you can see, it has a fact ID, and next to it the actual fact. The fact consists of the found term ("Uruguay"), the text before and after the term (prefix and post), the document it was found in, and identifiers, saying what the term actually means. The identifiers are a ContentMine ID and a Wikidata Entity ID.

That's all it is, for now. Still pretty cool, to distinguish special words and abbreviations from normal one, and linking them with established identifiers like those from Wikidata.

Conifers

The second topic today is asking biological questions about conifers. Now that I know most parts of the ContentMine pipeline with all its extensions, I can start to think of what I want to learn about conifers with it. The first questions are simple ones, or at least ones with simple facts as answers. Take "What height does a grown Pinus sylvestris normally have?". I know the answer is the value of the property height of the tree, and that the value is measured in some length unit.

Now all I have to do is search for the answer. Not that easy, but doable. First, I see if there actually are papers about the height of trees under normal conditions. So let's search EUPMC with the following query:

"Pinus sylvestris"[Abstract] AND height

With this, it searches for articles with the exact text "Pinus sylvestris" in the abstract, and with the word "height" anywhere in the article. The first found article is, at first sight, a bit unclear in wether it has an interesting answer, so let's move on to the second one. Remember, we are only taking a peek at what's inside. The second article however, looks more promising. The first table already contains exactly what we're looking for, and more than that. Apart from the height of Pinus sylvestris species it also has the diameter, and all this for two other conifers as well.

The same goes for the third article. While the first table hasn't got height data, it does have the diameter of several species in separated age groups, not to mention the properties I hadn't even thought of, like bark crevice depth, and canopy cover.

(I tweeted about the fourth one, as there were some funny stylesheet issues)

And if only three papers yield so much, imagine what can be done with more papers. The search I showed had 78 results, and when combined with searches for all the other species, there should be hundreds of articles having answers to just one, simple question. And with the ContentMine, I can "read" all those articles, and collect and summarise all these facts, in a matter of hours. Of course, I'll need to make some specialised programs to perform exactly what I want to do, so that's exactly what I'm going to do the next months.

Saturday, October 1, 2016

Parsing invalid JSON

When developing Citation.js, I needed to get a JavaScript Object from a string representing one. There are several methods for this, and the one that comes to mind first is JSON.parse(). However, this didn't work. Consider the following code:

{
  type: 'article',
  author: ['Chadwick D. Rittenhouse','Tony W. Mong','Thomas Hart'],
  editor: ['Stuart Pimm'],
  year: 2015,
  title: 'Weather conditions associated with autumn migration by mule deer in Wyoming',
  journal: 'PeerJ',
  volume: 3,
  pages: [1,21],
  doi: '10.7717/peerj.1045',
  publisher: 'PeerJ Inc.'
}

It contains data in the standard format Citation.js uses. And it's written in JavaScript, not JSON. Valid JSON requires all double quotes (") and property names wrapped with, again, double quotes. Valid JSON is valid in JavaScript too, but I prefer to write it like this. To accommodate to myself and other people preferring simple syntax, I had to come up with something else.

Option two is eval(), a code that parses JavaScript in strings and executes it on the fly. However, using eval is usually strongly discouraged for multiple reasons, one being code injection. Here are two strings. Both are valid JavaScript Object when pasted directly in the script. Only the second is valid JSON.

When the first gets processed by eval, it alerts a string (which may suppressed in the iframe above). Any code can be put where the alert() function is called. The first can't be processed by JSON.parse(), so we skip to processing the second with eval. This doesn't alert "Bar" as opposed to the first alerting "Foo". The second can get processed by JSON.parse(), and when it does it outputs the expected data. As you can see, only JSON.parse() never permits code injection, as it gives an error when it isn't valid JSON and valid JSON can't contain code.

Better use JSON.parse() then. But how are we going to parse invalid JSON without code injection? I hate to say it, but with regex. I know you shouldn't parse anything with regex, but I don't really parse it and when it fails, JSON.parse() will throw an error anyway. I use the following regex patterns (in this order):

/((?:\[|:|,)\s*)'((?:\\'|[^'])*?[^\\])?'(?=\s*(?:\]|}|,))/g
Changes single-quoted strings to double-quoted ones. Explanation and example on Regex101
/((?:(?:"|]|}|\/[gmi]|\.|(?:\d|\.|-)*\d)\s*,|{)\s*)(?:"([^":\n]+?)"|'([^":\n]+?)'|([^":\n]+?))(\s*):/g
Wraps property names in double quotes. Explanation and example on Regex101

As I said, this doesn't work perfectly, but it does the trick and it doesn't seem to be dangerous. When using this on the invalidString, it produces invalid JSON, the parser throws an error and the user is kindly asked to input valid JSON. But when using normal JavaScript, with somewhat normal string content, it works just fine. And you still can use normal JSON if you want, of course. It tries if that would work before using the regex, as you can see in the source code here:

case '{':case '[':
  // JSON string (probably)
  var obj;
  try       { obj = JSON.parse(data) }
  catch (e) {
    console.warn('Input was not valid JSON, switching to experimental parser for invalid JSON')
    try {
      obj = JSON.parse(data.replace(this._rgx.json[0],'$1"$2"').replace(this._rgx.json[1],'$1"$2$3$4"$5:'))
    } catch (e) {
      console.warn('Experimental parser failed. Please improve the JSON. If this is not JSON, please re-read the supported formats.')
    }
  }
  var res = new Cite(obj);
  inputFormat = 'string/' + res._input.format;
  formatData = res.data;
  break;