Sunday, March 26, 2017

Citation.js: BibJSON

Citation.js now supports BibJSON. How I did that without actually updating Citation.js? Well, apparently I supported it all along. I've supported the quickscrape output format since July last year, and that turned out to be BibJSON. How convenient. I'll update the demo and docs to reflect this revelation (currently it just says "quickscrape's JSON scheme"), and, now that I can find actual documentation, some improvements to the parser. It's a good candidate for a new output format too.

Some side notes on updates v0.3.0-0 to v0.3.0-2: these are prerelease updates, making it possible to use code before I have fixed all the issues and added all the features I promised for version 0.3. These updates fixed a lot of file organization problems; next updates will restructure the Cite object and fix tests.

Sunday, February 26, 2017

SVM: Developing a brand new 3D model language for the Web

When I started programming a few years ago, one of my first projects was linked to a project for school. Our team was making a presentation, and at some point I decided I wanted to program it myself, for the web. This resulted in the version of a project which, to this day, doesn't have a name (but soon will have).

After a few months I wanted to expand this. With some research and trial and error in the area of CSS 3D Transforms, I made a program similar to impress.js. Regular slides, in the web, in 3D, but with some unique things. Of course, it wasn't as refined yet, but this was only the start.

These unique things[1] were 3D models, that you can view in your very own web browser. This on its own isn't that unique; There are plenty of libraries that do that, like A-Frame and xml3d. The special thing of my 3D models are that they're pure HTML and CSS. No canvas or WebGL needed, just a relatively modern browser. This opens a whole range of possibilities: viewing HTML structures and CSS styles in your default debugger; selecting text directly from the model; embedding audio, video, pages and anything else you can put in HTML; and, of course, the possibility to combine these models and the slides, as they use CSS 3D Transforms as well.

There are some disadvantages to this. HTML+CSS only allows for 2D elements, so to make a cube, you'd need 6 elements, a wrapper element and 10+ lines of CSS. And if you want to change the height, you need to change that one value in a lot of places. To solve that, I've started developing SVM, which stands for Scalable Vector (3D) Models. The name is very similar to SVG, and that's how I intended it: XML-based, simple, and with lots of built-in shapes.

The first (beta) version is still in development, but I can list some of the features:

  • Built-in solids: Cubes, cuboids, regular prisms, cylinders (to an extent), regular pyramids, regular right frustums, cones (to an extent), irregular convex and concave prisms using SVG Paths (!), and simple spheres
  • Planes, arcs, SVG Path-based curves
  • Groups (with names)
  • Transformation (translation, rotation, scaling) on any element (or at least the currently existing ones)
  • An element to include groups/components

Some of the features in action, including automatic shadow, which will adapt to the orientation of the elements in the future[2]

It may not be as refined yet, but I think there's a lot of potential. Currently, the main hurdle is performance and the accompanying visual issues. For example, on my laptop, Chrome starts lagging and breaking apart[3] at around 450 elements, and keep in mind that even a single cube results in six elements. Also keep in mind: it works fine for up to around 1200 elements on my PC.


[1]: As far as I know, anyway.
[2]: Currently, it doesn't, as you can see on the yellow cylinder next to the red cube.
[3]:

As you can see, parts of elements just disappear

Saturday, February 4, 2017

Citation.js Version 0.2.11: jQuery and Looking Forward

A few weeks ago I published version 0.2.11 of Citation.js. The main change was the addition of jQuery.Citation.js, updated for version 2 of Citation.js. jQuery.Citation.js is a small jQuery plugin to build simple forms where you can fill in metadata, which gets translated to CSL-JSON. The configuration options are currently quite limited, so it only really works as a demo for Citation.js itself.

Citation.js logo

Since then (the current version is 0.2.15), the improvements have been bug fixes and the addition of (more support for) certain fields in both Wikidata and BibTeX. More interesting is what's going to happen in the next few releases.

Version 0.3

I've been planning version 0.3 for a while now and these things are probably going to be in it:

  • Asynchronous parsing: Parse input asynchronously, to allow asynchronous file requests, mainly for Wikidata.
  • More BibTeX: Publication type-specific fields, for example, should be parsed accordingly. I recently found a new guide to BibTeX, which should help as well.
  • Helper methods: Expose certain sub-parsing methods, like parseName and parseDate.
  • Structure Cite better: De-expose data that shouldn't be changed, add version information to Cite, etc.
  • Deprecate log: It's more or less useless. I'll add an option to enable it.
  • Structure code better: It's a mess, and things are broken. I'll change some file locations and add browserify etc.

Sunday, November 27, 2016

Weekly Report 12: Including table facts in Wikidata

This week, the big achievement is the addition of a multi-step form to add the semantic triples from last week to Wikidata with QuickStatements, which we talked about before too. The new '+' icon in table rows now links to a page where you can curate the statement and add Wikidata IDs where necessary. At the last step, you get a table of the existing data, the added identifiers and soon their Wikidata label. There, you can open the statement in QuickStatements, where it will be added to Wikidata. This is a delicate procedure; it's important to keep an eye on the validity of the statement.

The multi-step form in action (see tweet)

Take the following table:

Fungus Insect Host
Fungus A Insect 1 Host 1
Fungus B Insect 2 Host 2

This is currently interpreted (by the program) as "Fungus A has Insect 1 as Insect and Host 1 as Host", while the actual meaning is "Fungus A is spread by Insect Insect 1, of which the Host is Host 1" (see this table). While stating Host 1 is the Host of Fungus A is arguably correct, this occurs in much worse ways as well, and it's hard to account for it.

On top of that, often table values don't have the exact data, but abbreviations. It shouldn't be too hard to annotate abbreviations with their meaning before parsing the tables, but actually linking them to Wikidata Entities is still a problem. This is, obviously, the next step, and I'll continue to work on it the next few weeks. One thing I could do is search for entities based on their label. A problem with that is that I would have to make thousands of calls to the Wikidata API in a short period of time, so I'll probably create a dictionary with one, big call, if the resulting size allows it.

There is a pull request pending that would solve the issue norma has with tables, so I'll incorporate that into the program as well, making it easier for me as some things will already be annotated.

Conclusion: the 'curating' part of adding the statements to Wikidata is currently *very* important, but I'll do as much as possible to make it easier in the next few weeks.

Sunday, November 20, 2016

Citation.js Version 0.2.10: BibTeX and Travis

The new update (Citation.js v0.2.10) doesn't have a big impact on the API, but a lot has changed in the back end. ("back end" as in the helper functions that are called when the API is used.) First of all, there is Travis build tests now. It doesn't have edge test cases yet, but it covers the basics.

Testing the basic specs

The main thing is the new BibTeX I/O system. It isn't completely new, but a lot has changed. Before, BibTeX files were parsed with RegEx, you know, the thing you don't want to be parsing important things with. This worked fine for parsing the type of BibTeX I use in the test cases, but it failed in unexpected ways in case of a different syntax and it was a tedious piece of code as RegEx usually is.

Now I have a character-by-character parser (taking escaped character into consideration), that outputs clear error messages and recovers already parsed data when syntax errors occur. This parser converts BibTeX files into its JSON representation (which I call BibTeX-JSON), which then can be converted to CSL-JSON. There are some improvements here as well, but not as many.

The output is done by first converting CSL-JSON to BibTeX-JSON. This process is improved a bit as well. The BibTeX-JSON can then be converted to BibTeX. This now supports the escaping of syntax characters (e.g. '|' into '{\textbar}').

Sunday, November 13, 2016

Weekly Report 11: Making Object-Property-Value triples

In Weekly Report 10 I talked about searching for answers to the question "What height does a grown Pinus sylvestris normally have?". In the post, I looked at some of the articles returned by the query "Pinus sylvestris"[Abstract] AND height, and found interesting information in the tables. The next step was to extract this information.

So that's what I did. The first step was downloading fulltext XML files from the articles with getpapers. Secondly, I wanted to normalise the XML to sHTML with norma, but it currently has an issue with tables, so I resorted to parsing tables directly from the XML files. This wasn't really a downgrade. The main difference would be arbitrary XML tags instead of e.g. <i> and <sup>. Although it is more about layout, it still conveys meaning, e.g. the text content is probably a species name in the case of italics.

There are more problems, but I will go through these later. First off, how do I get the computer to gain information from tables? Well, generally, tables have the following structure:

Subject Type Property 1 Property 2
Object A Value A1 Value A2
Object B Value B1 Value B2

A small program can convert this to the following JSON:

[
  {
    "name": "Object A",
    "type": "Subject Type",
    "props": [
      {
        "name": "Property 1",
        "value": "Value A1"
      },
      {
        "name": "Property 2",
        "value": "Value A2"
      }
    ]
  },
  {
    "name": "Object B",
    "type": "Subject Type",
    "props": [
      {
        "name": "Property 1",
        "value": "Value B1"
      },
      {
        "name": "Property 2",
        "value": "Value B2"
      }
    ]
  }
]

This is pretty easy. Real problems occur when tables don't have use this layout. Some things can be accounted for. Take the example (see figure below) where tables don't have a head row. Because of this, I can't read the properties. The solution is to look in the previous table and copy the properties. Or when there are empty values in the left column. Usually, it implies the above row contains all necessary information, and it is done as a way to group table rows. Again, the solution is doing what the reader would do: look at the text in the cell above.

Example of split tables. From Ethnoveterinary medicines used for ruminants in British Columbia, Canada (DOI:10.1186/1746-4269-3-11)

But there are things that can't (easily) be accounted for as well. For example when the table is too long for the page, but narrow enough for two tables to fit. Instead of aligning two tables besides each other, they break the table apart and stick it back together, but in the wrong place. A very common problem is when they transpose the table. The properties are now in the left column, and the objects in the top row. These problems aren't that hard to fix; the hard thing is for the computer to find out what table layout is currently used.

JSON processing

But let's drop these issues for a moment. Assume enough data IS valid. We now have the JSON containing triples. I wanted to transform this to a format similar to ContentMine facts, as used by factvis. Because the ContentMine fact format is only used for identified terms for now, I had to come up with some new parts. It currently looks like this:

{
  "_id":"AAAAAAB6H6",
  "_source": {
    "term":"Acer macrophyllum",
    "exact":"Acer macrophyllum Pursh (Aceraceae) JB043",
    "prop": {
      "name":"Local name"
    },
    "value":"Big leaf maple",
    
    "identifiers": {
      "wikidata":"Q599523"
    },
    "documentID":"AAAAAAB6H5",
    "cprojectID":"PMC1831764"
  }
}

We have the fact ID, just as with ContentMine facts, and the _source object, containing the actual data. The fact ID is unique for every object, not every triple. This is to preserve context. The _source has a documentID and a cprojectID. Both say what article the triple was found in. exact has the exact text found in the cell, while term has the extracted plant name, if present. Otherwise, it contains the cell's contents. prop contains the text found in the property cell, and in some cases some other data as unit etc., if found. value contains the text in the value cell.

The property identifiers contains some IDs, if the term is identified. This is done by creating a 1.8-million large dictionary of species names linked to Wikidata IDs, directly from Wikidata, and matching these with terms. Props, values and other types of terms will hopefully have identifiers soon too.

Further processing

To quickly see the results, I made some changes to factvis to make it able to work for my new fact data. Here it is. It currently has three sample data sets, each with 5000 triples. I myself have ten of these sets, and I will release more when the data is more reliable; currently, it's just a demo.

Triple visualisation demo

However, we can already see the outlines of what is possible. Even if the data isn't going to be reliable, I can set up a service (with QuickStatements) for people to curate the data and add it to Wikidata with a few clicks. And if 50000 triples (some of which may be invalid) don't sound as enough to set up such a service, remember that they come from a simple query returning only 174 papers.

Friday, October 28, 2016

Citation.js Version 0.2: NPM package, CSL and more

In the last two weeks I've been busy with making Version 0.2 of Citation.js. Here I'll explain some of the changes and the reasoning behind them.

In the past months I've updated Citation.js several times, and the changes included a Node.js program for the commandline and better Wikidata input parsing. While I was working with the "old" code, I noticed some annoying issues in it.

One of the biggest things was the internal data format. When Cite(), the main program, parses input, makes it into JSON with a standardised scheme, which is used everywhere else in the program, e.g. for sorting and outputting. The scheme I used was something I made up to accommodate for the features it had back when there was next to no input parsing, and you were expected to input JSON directly, either by file or by webform. It wasn't scalable at all, and some of the methods were patched so much they only worked in specific test cases.

Old interface of Citation.js (pre-v0.1). It would fetch the first paragraph of Wikipedia about a certain formatting style. Many of the supporting methods of this version stayed in v0.1.

Now I use CSL-JSON, the scheme used by, among others, citeproc-js, and the standard described by the Citation Style Language. It is designed by professionals, or at least people more qualified to making standards. It is quite similar to my old scheme, with some exceptions. Big advantages are the way it stores date and person information. Before, I had to hope users provided names in the correct format. Now, it doesn't matter, as it gets parsed to CSL. The same goes for the date. Another advantage is the new output potential. Besides the output of CSL-JSON, it is now possible to use citeproc-js directly, without extra conversion.

Using a new data format also meant a lot of cleanup in the code. Almost all methods had to be re-written to account for it, but this was mostly a chance to write them more properly. Now, only Cite() exists in the global scope, which is good, because it means other parts don't take up variable names, etc. The entire program is now optimised for both browser and Node.js use, although it uses synchronous requests. From the perspective of the program it is necessary to be able to use synchronous requests. However, it is possible for users to bypass this for a big part. It is mostly used for >Wikidata parsing. An example:

Let's take the input "Q21972834". This is a Wikidata Entity ID, and it points to Assembling the 20 Gb white spruce (Picea glauca) genome from whole-genome shotgun sequencing data (10.1093/bioinformatics/btt178). If Cite() only has the ID, it has to fetch the corresponding data (JSON). Because Cite() is called as a function, and is expected to return something, it has to make the request synchronously. However, if the user fetches the data asynchronously and calls Cite() in the callback, that is bypassed.

var xhr = new XMLHttpRequest();

xhr.open(
  /* Method */ 'GET',
  /* URL    */ 'https://www.wikidata.org/wiki/Special:EntityData/Q21972834.json',
  /* Async  */ true
)

xhr.addEventListener( 'load', function () {
  var data = this.responseText
  
  var cite = new Cite( data )
  
  // Etc...
} );

xhr.send( null )

The JSON gets to Cite() with only async(hronous) requests. The problem is that this JSON doesn't contain everything. Instead of the name of the journal it's published in, it contains a numeric reference to it. To get that name, I have to make another request, which has to be sync as well. I hope there is some way in the Wikidata API to turn references off and names (or "labels") on, but I haven't found one yet. That being said, I had to search a long time to find the on-switch for cross-domain requests in the Wikidata API as well, so it might be hidden somewhere. If that's the case then sync requests can be bypassed everywhere, which would be nice, as browsers are on the verge of dropping support.

Probably the biggest news is that it is a NPM (Node Package Manager) package (link). This means you can download the code without having to clone the git repository, or copy the source file. It's even available online, although the default npm demo seems to be broken. Luckily, I have a regular demo as well. As of writing, the npm page says the package has been downloaded 234 times already, but that number has been the same for a day, so I guess there is an issue with npm. If not, that's really cool.