Monday, September 4, 2017

ctj rdf: Relations Between Conifers Mentioned in Articles

Below is part two of a small series on ctj rdf, a new program I made to transform ContentMine CProjects into SPARQL-queryable Wikidata-linked rdf.

Here’s a more detailed example. We are using my dataset available at 10.5281/zenodo.845935. It was generated from 1000 articles that mention ‘Pinus’ somewhere. This one has 15326 statements, whereof 8875 (57.9%) can be mapped, taking ~50 seconds. Now, for example, we can list the top 100 most mentioned species. The top 20:

Species Hits
Pinus sylvestris 248
Picea abies 177
Pinus taeda 138
Pinus pinaster 120
Pinus contorta 96
Arabidopsis thaliana 91
Picea glauca 77
Pinus radiata 77
Pinus massoniana 72
Pseudotsuga menziesii 65
Oryza sativa 56
Pinus halepensis 56
Pinus ponderosa 55
Pinus banksiana 53
Pinus koraiensis 53
Picea mariana 52
Pinus nigra 51
Pinus strobus 46
Quercus robur 45
Fagus sylvatica 45

The top of the list isn’t surprising: mostly pines, other conifers, other trees, Arabidopsis thaliana which I’ve seen represented in pine literature before, and Oryza sativa or rice, which I haven’t seen before in this context.

Note that only 248 of the 1000 articles mention the top Pinus species. This may be because the query getting the articles was quite broad. Also note that this doesn’t take into account how often an article mentions a species; a caveat of the current rdf output.

Going of this list, we can then look what non-tree or even non-plant species are named most often in conjunction with a given species, or, in this case, a genus. Top 20:

Species 1 Species 2 Co-occurences
Picea abies Pinus sylvestris 98
Picea abies Pinus taeda 56
Arabidopsis thaliana Pinus taeda 47
Picea glauca Pinus taeda 43
Arabidopsis thaliana Oryza sativa 43
Pinus pinaster Pinus taeda 41
Pinus pinaster Pinus sylvestris 41
Picea abies Picea glauca 41
Arabidopsis thaliana Picea abies 37
Pinus contorta Pinus sylvestris 36
Betula pendula Pinus sylvestris 36
Pinus sylvestris Pinus taeda 36
Pinus nigra Pinus sylvestris 35
Pinus contorta Pseudotsuga menziesii 32
Picea abies Pinus contorta 31
Picea abies Pinus pinaster 30
Arabidopsis thaliana Physcomitrella patens 30
Oryza sativa Pinus taeda 29
Pinus sylvestris Quercus robur 29
Picea abies Picea sitchensis 28

Interesting to see that rice is mostly mentioned with Arabidopsis. Let’s explore that further. Below are species named in conjunction with Oryza sativa.

Species Co-occurrences
Arabidopsis thaliana 43
Pinus taeda 29
Picea abies 24
Physcomitrella patens 23
Populus trichocarpa 21
Glycine max 20
Vitis vinifera 17
Picea glauca 17
Pinus pinaster 16
Selaginella moellendorffii 15
Pinus sylvestris 13
Triticum aestivum 12
Pinus contorta 10
Picea sitchensis 10
Ginkgo biloba 10
Pinus radiata 10
Ricinus communis 9
Amborella trichopoda 9
Medicago truncatula 9
Cucumis sativus 8

So attention seems divided between trees and more agriculture-related plants. More to explore for later.

View all posts in this series.

Sunday, September 3, 2017

ctj rdf: Part One

Below is part one of a small series on ctj rdf, a new program I made to transform ContentMine CProjects into SPARQL-queryable Wikidata-linked rdf.

ctj has been around for longer, and started as a way to learn my way into the ContentMine pipeline of tools, but turned out to uncover a lot of possibilities in further processing the output of this pipeline (1, 2).

The recent addition of ctj rdf expands on this. While there is a lot of data loss between the ContentMine output and the resulting rdf, the possibilities certainly are no less. This is mainly because of SPARQL, which makes it possible to integrate in other databases, such as Wikidata, without many changes in ctj rdf itself.

Here’s a simple demonstration of how this works:

  1. We download 100 articles about aardvark (classic ContentMine example)
  2. We run the ContentMine pipeline (norma, ami2-species, ami2-sequence)
  3. We run ctj rdf

This generates data.ttl, which holds the following information:

  • Common identifier for each article (currently PMC)
  • Matched terms from each article (which terms/names are found in which article)
  • Type of each term (genus, binomial, etc.)
  • Label of each term (matched text)
Example data.ttl contents

Note that there are no links to Wikidata whatsoever. When we list, for instance, how often each term is mentioned in an article (in the dataset), we only have string values, a URI and some custom namespace URIs.

However, in this format, we can easily use the information in these papers in conjunction with the enormous amount of data in Wikidata with Federated Queries.

To accomplish this we first link the identifier in our dataset to the ones in Wikidata; then we link the matched text of the term to the taxon name in species in Wikidata. This alone already gives us a set of semantic triples where both values in every triple are linked to values in the extensive community-driven database that is Wikidata.

Example query, counting how often each species is mentioned, and mapping them to Wikidata
Results of the above query

Now say we want to list the Swedish name of each of those species. Well, we can, because that info probably exists on Wikidata (see stats below). And if we can’t find something, remember that each of those Wikidata values are also linked to numerous other databases.

Again, this is without having to change anything in the rdf output (to be fair, I forgot to list an article identifier in the first version of the program, but that could/should have been anticipated). Not having to add this data to the output has the added benefit of not having to make and maintain local dictionaries and lists of this data.

Some stats:

  • Number of articles: 100 (for reference)
  • Number of ‘term found in article’ statements: 1964
  • Number of those statements that map to Wikidata: 1293 (65.3% of total)
  • Number of mapped statements with Swedish labels: 1056 (81.7% of mapped statements, 53.8% of total)
  • Average number of statements per article: 19.64, 12.93 mapped

Note that not all terms are actually valid. A lot of genus matches are actually just capitalised words, and a lot of common species names are abbreviated, e.g. to E. coli, making it impossible to unambiguously map to Wikidata or any other database. This could explain the difference between found ‘terms’ and mapped terms.

View all posts in this series.

Sunday, August 27, 2017

Citation.js: Endpoint on RunKit

A while back I tweeted about making a simple Citation.js API Endpoint with RunKit.

Using the Express app helper and some type-checking code:

const express = require('@runkit/runkit/express-endpoint/1.0.0')
const app = express(exports)
const Cite = require('citation-js@0.3.0')
const docs = ''

const validStyle = style => ['bibtxt', 'bibtex', 'csl', 'citation'].includes(style) || /^citation-\w+$/.test(style)
const validType = type => ['string', 'html', 'json'].includes(type)

const last = array => array[array.length - 1]

const getOptions = params => {
  const fragments = params.split('/')
  let data, style = 'csl', type = 'html'

  // parse fragments
  // (got pretty complex, as data can contain '/'s)

  return {data, style, type}

.get('/', (_, res) => res.redirect(docs))
.get('/*', ({params: {0: params}}, res) => {
  const {data, style, type} = getOptions(params)
  const cite = new Cite(data)
  const output = cite.get({style, type})

Full code here. Makes an API like this:



  • $CODE is the API id (vf2453q1d6s5 in this case),
  • $DATA is the input data (DOI, Wikidata ID, or even a BibTeX string),
  • $STYLE (optional) is the output style,
  • and $TYPE (optional) is the output type (basically plain text vs html).

This makes it possible to link to a lot of Citation.js outputs:

Citation.js Version 0.3 Released!

It’s been in beta since January 30, but here it is: Citation.js version 0.3. Below I explain the changes since the last blog post, and under that is a more complete change log. Also some upcoming changes.


Recent changes

Custom citation formatting

  • One of the remaining milestones for v0.3.0 was a better API for custom citation formatting, as outlined in the previous post. The exact implementation differs a bit, but is essentially the same. Docs can be found here.


  • Cite#add() and Cite#set() now have asynchronous siblings, Cite#addAsync() and Cite#setAsync().
  • Both async and sync test cases are needed for full coverage, so I won’t just change one into another, as I previously suggested.


  • The BibTeX parser got a big update, improving both the text-parsing code and the JSON-transforming code.
  • The Wikidata issue (sorry for the GIF) is fixed now, and I like the current API and code. The solution is inspired by the recent BibTeX refactoring.

Older changes

Also look at previous blog posts for more info. List goes from newer to older.


  • Code coverage
  • Test case and framework updates
  • DOI support
  • CLI fixes
  • Custom sorting
  • Cite as an Iterator
  • Bib.TXT support
  • Give jQuery.Citation.js its own repository
  • Async support
  • Build size disc & dependency badges
  • Exposing internal functions and splitting up the main file into smaller files
  • Change of some internal Cite properties
  • Using an actual code style and use of ES6+
  • Exposing version numbers


  • CLI bugs
  • Variable name typos
  • Output JSON is now valid JSON (I know, right?)
  • Wikidata ID parsing
  • Cite#getIds()
  • CLI not working
  • Wikidata tests not working
  • CSL not working
  • Browserify not working

Upcoming changes

  • Web scrapers and, Open Graph, etc.
  • Input parsing extensions
  • More coverage
  • More input formats, like better support for BibJSON, and new formats.

Sunday, July 23, 2017

Citation.js: DOI update and more stability

Finally, Citation.js supports DOIs. It took a while, but it’s finally there. One big ‘but’: synchronous fetching doesn’t work in Chrome. I’m still looking into that, but I should be recommending you to use Cite.async() anyway. Also in this blog post: more stability in Cite#get(), a welcome byproduct of using the DOI API, and looking forward (again).

Citation.js logo

DOI support

So, DOIs. That was (and is) a though one. Let me guide you through the process.

Initial development

I have been planning to add support for DOI input since the beginning of this year, going off the original feature request, and the pressure to implement it has grew right along me realising how important they are in the world of citations. Back then I thought it would be smart to query Wikidata for DOIs, as I had just finished some code on Wikidata ID input, that used regular API as well as the query part. More recently however, I learned about the Crossref API and, even more helpfully, the DOI Content Negotiation API, which combines the APIs from Crossref, DataCite and mEDRA. I’ll quote a piece from the docs:

Content negotiation allows a user to request a particular representation of a web resource. DOI resolvers use content negotiation to provide different representations of metadata associated with DOIs.

A content negotiated request to a DOI resolver is much like a standard HTTP request, except server-driven negotiation will take place based on the list of acceptable content types a client provides.


Basically, you just make a request to$DOI (where hopefully obviously $DOI stands for the DOI you’re looking for) with an Accept header set to the format you want your data in. Conveniently, one of those formats is CSL-JSON, here charmingly named application/vnd.citationstyles.csl+json, but nonetheless the direct input format for Citation.js. This would allow me to only have to write code to extract DOIs from a whitespace-separated list (const dois = string.trim().split(/\s+/g)) and fetch them from the server (await Promise.all(, where fetchDoi is a simple function:

async function fetchDoi (doi) {
  const headers = new Headers({
    Accept: 'application/vnd.citationstyles.csl+json'
  const response = await fetch('' + doi, {headers})
  return response.json()

And Headers and fetch are built-in. Note that this uses some advanced syntax. Read more about async/await, Promise.all(), shorthand property names and const.

First problems

It wasn’t that simple. Heck, even getting the API to work took a (long) while and a lot of debugging other people’s code. You see, to make synchronous requests in Node.js I use sync-request, which is a synchronous wrapper around its asynchronous cousin, then-request, which is a wrapper around an also asynchronous but more low-level cousin, http-basic. They all use the same scheme of options. Because of that, sync-request passes its options down all the way to http-basic, so options from http-basic but not in sync-request can be used too. Turns out, http-basic has an option that removes all headers when a request is redirected, but it isn’t documented in sync-request. This removes, among other things, the Accept header, which, on omission, produces an HTTP 501 code not documented in the API.

Let’s just say I filed a feature request to mention this in the docs.

The actual code didn’t need much change, so soon I got the first JSON response, and since the API supports CSL-JSON out of the box, I didn’t need to do much else, except building some input recognition infrastructure.

CSL-JSON problems

However, nothing is perfect, and so is that out-of-the-box support of CSL-JSON. This was actually picked up quite well, and I’m happy with the outcome. Basically, extra code needed to fix this consists of two parts. One part transforms some invalid but essential parts of the API response to its CSL-JSON counterpart. The second part filters invalid and less essential parts out of the data, and basically acts as a safeguard for methods depending on certain variables having certain types. More on this later.

More problems, from an unexpected source

Yep, it isn’t even over yet. This may be the biggest problem of them all, in that it hasn’t got a solution yet, and that I don’t expect it to have one in the near future, and that it is very likely there is no solution. It depends, it really does. Anyway, let’s get to the point.

Chrome doesn’t support synchronous redirects from CORS domains. Or something. Even that isn’t really clear. Basically, the problem is that generally, the DOI content negotiation works like this (assuming we’re in the browser): the user domain requests data from, which redirects to (or a different domain). Both are different domains. Both synchronous and asynchronous requests directly to (or a different domain) work. Asynchronous requests via work too. Even synchronous requests via work, but not in Chrome. I don’t know why, I don’t know when (as it does support normal synchronous CORS requests), and I can’t find any document that says why it does that. The only things that have shed some light on this are a comment I can’t find anymore that only stated you can’t do that at all and this answer, that almost describes the exact problem I’m having, but fails to give any answer other than that it’s weird that Firefox and IE “don’t follow the jQuery spec.” Note that the jQuery spec doesn’t mention anything about this apart from the note that this shouldn’t be done ever anyway.

One obvious answer is that every mayor browser is currently in the process of factoring out all synchronous effect, and that Chrome might have taken this a step further, but even then, there should be some document somewhere that says it does this, otherwise I’d just consider it a bug. I haven’t gotten around to it yet, but I will try to see how this would affect using Citation.js synchronously in Web Workers. If that works, it means that it actually is another rule to prevent synchronous requests on the main thread, which is fine by me, although slightly inconvenient for people who do not care about user experience, such as me when I’m lazy.


DOIs work. Because of all the ranting, I haven’t shown you the api, but it’s pretty familiar. To use the CLI, do this:

> npm i -g citation-js
# to install the new version of the command
# might need to be run with admin rights

> citation-js -t 10.1021/ja01577a030 -f string -s citation-apa
# Fetches data from doi: 10.1021/ja01577a030
# in the format: string; in the formatted style: apa

  Hall, H. K., Jr. Correlation of the Base Strengths of Amines1. Journal of the American Chemical Society, 79(20), 54415444.

To use the API, do this:

const Cite = require('citation-js')

// For synchronous use
const data = new Cite('10.1021/ja01577a030')
const output = data.get({
  type: 'string',
  style: 'citation-apa'

/* output =
  Hall, H. K., Jr. Correlation of the Base Strengths of Amines1. Journal of the American Chemical Society, 79(20), 5441–5444.

// For asynchronous use
Cite.async('10.1021/ja01577a030').then(data => {
  const output = data.get({
    type: 'string',
    style: 'citation-apa'

  /* output =
    Hall, H. K., Jr. Correlation of the Base Strengths of Amines1. Journal of the American Chemical Society, 79(20), 5441–5444.

Stability in Cite#get()

As I mentioned earlier, I had to write some code to filter out invalid but non-essential props from the CSL-JSON. Otherwise certain methods used by Cite#get() throw errors, as they except these props to have certain types. Because of this filtering function, I don’t have to type-check ever again, I hope. The implementation is pretty simple. Specially structured props like author and other names, and issued and other dates, get caught and handled on their own. Other props get checked against a map of expected data types and removed if they don’t match, and if the bestGuessConversions flag is set, if they also can’t be converted reliably. Note that, as of now, all Cite.get.*() functions expect CSL-JSON cleaned by Cite.parse.csl().

Version 0.3.0

Implementing DOI input was one of the big milestones on the way to version 0.3.0, besides exposing internal methods, making async input parsing available and generally making the browser version and the CLI less bad. Now that those three things are done, I think it’s a good moment to see what still needs to be done for the v0.3.0 release.

Better custom formatted citations

Using formatted citations is fine, but when you try to use custom ones, either because the style guide you want or need to follow isn’t built-in, or because you have some special use case, things may get confusing. Currently, the API works (should work) like this: when you pass a template in the Cite#get() options (important: not in the new Cite(<data>, <options>) options), it gets registered in a register of citeproc-js engines. After that, you can use it by referencing the name you used in the regular style option. This would work better with an API similar to this:

const template = '...'
const templateName = 'custom'

// Suggested new API
Cite.CSL.template.register(templateName, template)

// Nothing new, will stay the same
const data = new Cite(...)
  type: 'html',
  style: 'citation-' + templateName

Also, sometimes you just want to append or prepend some text or HTML frames, and implementing that with CSL templates takes time and effort and the result isn’t always that pretty, as templates don’t support direct HTML. There will be a new API for that too, probably like this:

// prepends the id like this: "[$ID]: "
const prepend = ({id}) => `[${id}]: `
// appends an altmetric widget
const append = ({DOI}) => `<span class='altmetric-embed' data-doi='${DOI}'></span>`
// Nothing new, will stay the same
const data = new Cite(...)
  type: 'html',
  style: 'citation-' + templateName,
  // Suggested new API
  // both properties either function or constant string
  append: append,
  prepend: prepend

One thing I still have to look at is extending the wrapping HTML elements outputted by citeproc-js.

Use asynchronism better

Now that we have asynchronous input parsing with Cite.async(), it may be interesting to look at asynchronous output formatting as well. There are also some functions that can use Cite.async() but don’t, like Cite#add() and Cite#set(), and the test cases are generally synchronous, with some special cases for async, while it should probably be the other way around. Adding a coverage tester will assist in determining which synchronous test cases can go.


There are still some things I’d like to refactor, like the Wikidata parser. I don’t like it right now, especially the hack to merge different types of authors.

After v0.3.0

Since we’re almost at the end of v0.3.0, I’ll also outline some of my plans for future versions.

  • BibJSON input (already partly supported)
  • Extensions (input parsing, output, etc.)
  • Zotero input (maybe not worth the work, as they already support export to CSL)
  • Scraper (could just be getting DOI and using my existing work)
  • Coverage testing and an expansion of test cases
  • More work on the dependants:

Sunday, June 4, 2017

Microscopic photography: Part 1

I got to make photographs of pre-made specimens with a microscope, and I wanted to show some of them. This is part one of a longer series; there were a lot of specimens.

Below are details of the cross section of a basswood stem.


Outer layer


Human artery tissue.

Human bone tissue (with some unidentified dark stuff).

More next week!

Monday, May 22, 2017

Citation.js: Async, Showdown and Bib.TXT

I worked on updates for Citation.js in the past few weeks, and I thought I'd go over them in this post.


First of all, Citation.js now has support for asynchronous parsing, so it won't lag your app as much when it uses e.g. the Wikidata API. This is good, as synchronous requests are not only blocking your app, but also deprecated in most major browsers. This allows for more development on input formats where additional data needs to be fetched, like a DOI as input. I could have done this before, but this order made more sense to me. The API is simple:

// With callback

Cite.async($INPUT, $OPTIONS, function (data) {
  data // instance of Cite
  // Further manipulations...

// Or with a promise, like this:

Cite.async($INPUT, $OPTIONS).then(function (data) {
  data // instance of Cite
  // Further manipulations...

// Where $INPUT is input data and $OPTIONS is options
// $OPTIONS is optional in both cases

The promise-returning part is good when using the async await syntax, i.e. var data = await Cite.async($INPUT, $OPTIONS).


Unfortunately, no actual extensions to the Cite object. (However, extensions in input parsing is planned in v0.4, and the API for output formatting is getting a rework in v0.3 (this and more)).

I'm talking about citation.js-showdown, citation.js-form and citation.js-replacer (coming soon). Citation.js-form is just jquery.citation.js in its own repository, citation.js-replacer will be a small tool to easily put references on your site, without having to bother with writing scripts, but, as a consequence, with fewer options. Citation.js-showdown, however, is a (functioning) Showdown plugin, that makes it easier to make references and bibliographies in a markdown document.


A few days ago I heard about Bib.TXT, and since implementing it in Citation.js only meant parsing a relatively simple text format (all the fields are the same as the already supported BibTeX), I thought: Why not? So there you have it, Bib.TXT support in Citation.js. That's good, because, among other things, you don't have to bother with weird (La?)TeX Unicode workarounds. And with Citation.js, you can easily convert Bib.TXT to BibTeX. Below as a command:

$ npm i -g citation-js@0.3.0-7
$ citation-js --input inputFile.txt --output-format string --output-style bibtex > out.bib

Homepage improvement

I changed some things on the Citation.js homepage. It should now be more responsive and have more room for expansion, and I added a demo and a small list of the "extensions" mentioned above. Oh, and the README has the same banner as the page now.

Citation.js banner