Thursday, April 11, 2019

Citation.js Version 0.4: The Catch-up

Citation.js Version 0.4: The Catch-up

After about two years, finally v0.4.0 is ready for a release. Whether it is a real milestone given that there have been prereleases for about two years, while 0.3.x only lasted three weeks, is for you to decide but I am glad that everything I planned for this version is implemented now.

Citation.js logo

Changes

So, let’s go through the major changes.

Modules

Most importantly, the code is fully modulated now. The different components, the core, the CLI and all the individual format plugins, are available on their own now. The citation-js is now designed as a shell around those components, maintaining full backwards compatibility (apart from mapping changes) until that is no longer required.

Mappings

A lot of different formats where either improved or introduced. The main newcomer is RIS, still only as output format. Support for NDJSON output was also added. The handling of styling and case-preserving brackets in BibTeX fields was improved, although the method of parsing those files is still lagging behind.

Thanks to (independent) practical testing by Egon Willighagen and Jakob Voss and the wikidata-sdk library by Maxime Lathuilière we were also able to improve the Wikidata mapping quite a lot, although there is still work ahead.

Configuration

Apart from the potential for customization that comes with the separate modules, there is quite some new configuration available. First of all, input parsing has some options: strict which basically switches between errors and failing silently (not-so-long ago the default behavior); and target which allows the user to specify a certain point at which the parsing should be stopped, mainly useful for debugging.

Furthermore, individual input formats can be configured now too. For example, the default languages used in the Wikidata plugin can be configured now. The methods to add CSL templates and locales which was already available is using the same mechanism now.

Finally, CSL output with citeproc-js has been amended. First and foremost, support for citation has been added (or rather, the original citation-* has been renamed to bibliography), so it is possible to get actual citations now. Also, the nosort option has been added and the prepend/append options been improved.

Stability & best practices

Also important, there’s a lot more testing now and the commit messages and code style follow standard guidelines. Making a new repository for the different modules actually helped with that, allowing me to re-evaluate the decisions I made in that aspect.

The full changelog is available here.

Paper & Preprint

In the middle of all that (literally, the development halted for a few months) I wrote a paper about Citation.js, which I am currently revising after review. The preprint is available here.

Further development

There’s still a lot to be done, most of which has been discussed in the “Discussion” section of the preprint. I’ll go into some things that are planned for the (relatively) near future, other than additional mappings of course.

(Another) Wikidata refactor

Part of the recent development to finish the release was a refactoring of the Wikidata parsing code to facilitate some changes and mainly to reduce unnecessary code duplication, originally the result of sync/async variants of almost every function. However, while working on that I came up with an even better refactor which should minimize the problems of under-fetching that I described in the preprint.

To recap, under-fetching is the problem of not being able to fetch enough information in the desired amount of HTTP requests. For example, in Wikidata it’s not possible to get item labels of property values such as P50 (author), and they require another HTTP call. It is not possible to fetch them together with the publication item you are fetching since you do not know who the authors are yet.

So the minimal number of requests (not taking the limit of 50 items per request) is one for each “level” you need to go down: author labels would be the second level, and fetching the labels of name items if you decide to use P735 (given name) and P734 (family name) would be the third. That’s what this refactor will try to accomplish, and it shouldn’t be too complex but still non-trivial.

Turning the chain parser into a tree parser

Currently the input is parsed iteratively, in a “chain” of types. For example, a Wikidata ID becomes a Wikidata API URL, which returns JSON, which gets parsed, and the resulting API response gets transformed into CSL-JSON. This also ensures that people could input a Wikidata API URL if they want, and that fetching and parsing JSON from a web resource doesn’t have to be implemented over and over again.

When encountering arrays, if they are not recognized as some special array, each element is parsed until the target (CSL-JSON) is reached, at which point it returns. However, this proved problematic with some new features, like the target option (parse until this format is reached, instead of the default CSL-JSON). Firstly, such options cannot be passed down to the other parsing functions, and secondly if a certain target format is reached in the array elements then the total format is always going to be an array of that, and will never match the target.

A refactor of the parser, branching it instead of whatever is going on currently, should counter this.

More citeproc interactions

citeproc is currently used in an almost state-less manner, which works fine for bibliographies but is quite limiting for working with citations. While users could redirect CSL output to citeproc themselves, why not do it for them with more bibliography management, which Citation.js mostly lacks at the moment.

Ditching CSL-JSON as the central format?

CSL-JSON is great, but also quite limiting considering the available properties, the general problems of which are also discussed in the preprint. While additions could be made to CSL in terms of variables, which Frank Bennet has already done with CSL-M, and which the CSL team is doing now with a new release finally specifying a computer program type, CSL remains intended for generating citations, not for storing bibliographical data. At least, according to a quote that I cannot find anymore.

However, this becomes quite clear practically. I count 10 of the 78 fields being specific to citing the item. An additional 4 are (usually) redundant and 3 poorly defined, one with just a question mark as description. On top of that, 12 could be replaced by allowing references to other publications, which would require just 4 fields instead, and 4 that could be replaced by other entities such as venues requiring no additional properties.

Still, moving to a different format is quite something and this will have to be discussed. Of course, support for CSL-JSON wouldn’t be removed either way, just not exclusively used for storage. Also, these changes would take place way after any pending changes.


And that is it for this post. Any feedback on the release and ideas for the things discussed above is very welcome here, in the GitHub repo or on Twitter (@larswillighagen).

Thursday, January 3, 2019

Citation.js: Wikidata Subclasses

Citation.js: Wikidata Subclasses

I am in the process of creating a better way to map CSL types to Wikidata types together with Jakob Voß, and while listing all subtypes of web page, I came across a number of these cases (link):

#defaultView:Graph
SELECT ?a ?aLabel ?b ?bLabel WHERE {
  wd:Q58803899 wdt:P279* ?a .
  ?a wdt:P279 ?b .
  FILTER EXISTS {
    ?b wdt:P279* wd:Q386724 .
  }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}

I’m no expert on Wikidata data models, nor on the semantics of each of those items in specific, but I’m pretty sure Count of Barcelos (Q58803899) is not a subclass of web page (Q36774). However, the individual parts seem pretty reasonable, so where does it go wrong then?

Luckily, the description of subclass of (P279) provides some good guidance:

all instances of these items are instances of those items; this item is a class (subset) of that item.

So let’s check those:

  1. All counts of Barcelos are Portuguese counts (conde), as Barcelos lies in Portugal
  2. All condes are counts
  3. Counts are not hereditary titles themselves, so I think this should be a instance of (P31). Also, counts are indirectly subclass of royal or noble rank through hereditary titles, but directly instance of royal or noble rank.
  4. Hereditary titles do not seem like an exact subclass of royal or noble ranks to me, but it passes the basic test
  5. I am unsure about royal or noble rank as subclass of rank, since the latter seems way more abstract, but perhaps that’s the point
  6. (rank -> first-order metaclass) I have no idea what the intention with these metaclasses is, so I’m going to assume this is all correct (bold move)
  7. first-order metaclass -> fixed-order metaclass
  8. fixed-order metaclass -> Wikidata metaclass: “class of classes, class whose instances are classes”
  9. Wikidata metaclass -> class or metaclass of Wikidata ontology (should maybe be P31 as well, still no clue though)
  10. class or metaclass of Wikidata ontology -> Wikidata item: finally back into some form of concrete-ness. Problem being, not all instances of royal or noble ranks are Wikidata items, so something went wrong somewhere in between, maybe even at (6).
  11. Wikidata item -> Wikidata internal entity
  12. Wikidata internal entity -> Wikidata internal item
  13. Wikidata internal item -> MediaWiki page
  14. MediaWiki page -> web page: actually pretty reasonable, instances of Wikidata items are web pages.

There are a little less than 900 instances of web page so the impact is smaller than I had expected, but it’s still annoying. There are 191 subclasses of conde, not to mention the total of 280 royal or noble ranks that have no business being a subclass of web page.