Syntaxus baccata: 2017

Friday, December 22, 2017

Citation.js Version 0.4 Beta: New Docs and Input Plugins

It’s been a while. Really. But now it’s back, with a new release: v0.4.0-0, the v0.4 beta. Below I explain some of the changes in this release, and then the road-map of Citation.js for v0.4 and v0.5. Also, Citation.js has a DOI now:

Citation.js is now on @ZENODO_ORG, and has a DOI! https://t.co/X2ThPA2S1I
— Lars Willighagen (@larswillighagen) October 11, 2017

Also, @jsterlibs tweeted about Citation.js.

Release

Input plugins

The main change in this release, is the addition of input plugins. This is the first step towards releasing v0.4, as explained below. Although there’s specific documentation available here and here, I’ll put an example here as well, as the API can be confusing.

The API for registering input plugins will be changed once or twice before the release of v0.4, once to make it less complex and weird, and possibly a second time to incorporate output plugins.

Adding a format

Say you wanted to add the RIS format (I should probably do that sometime). First, let’s define a type, to set things off.

const type = '@ris/text'

I can’t really define the actual parser here, but I’ll add a variable in the example code.

const parse = data => { /* return CSL-JSON or some format supported by any other parser */ }

Testing if a given string is in the RIS format, we’ll use regex. This regex matches if any line starts with two alphanumerical characters, two spaces and a hyphen. That’s a pretty fuzzy match; all lines should start with that sequence. However, this is just an example, and proper regex would be less elegant.

const risRegex = /^\w{2}\ {2}-/gm

Now, let’s define the dataType of RIS input. When using regex to test input, the dataType is automatically determined to be 'String' anyway, but for the sake of being clear:

const dataType = 'String'

Now, to combine it all:

Cite.parse.add(type, {
  parse,
  parseType: risRegex,
  dataType
})

Changing parsers

Now, say someone else wrote that code above, and you need to use it without modifying it, but you want a better regex? That can be arranged:

Cite.parse.add(type, {
  dataType: 'String',
  parseType: /^(?:\w{2}\ {2}-.*\n)+(\w{2}\ {2}-.*)?$/g
})

Because the options dataType, parseType, elementConstraint and propertyConstraint are all treated as one thing, you need to pass everyone of those when replacing the type checker. dataType is still not actually mandatory in this example, but is passed to demonstrate this.

Disabling a format

There is currently no way to remove a format.

In this scenario, some plugin registered a new format to get citation data from github.com URLs. Unfortunately, it doesn’t work. Instead of recognising the URL as a GitHub-specific URL, the type is @else/url, the generic URL type. This is because the generic version was registered earlier, and there is not yet a category for generic types (see #104). If you don’t use the @else/url type checker anyway, you can disable it like this:

Cite.parse.add('@else/url', {dataType: 'String', parseType: () => false})

I’m not sure why the dataType is needed here, as you’re disabling the parser anyway, but it doesn’t seem to work without it.

Docs

Proper CLI docs

Recently, I’ve been updating the documentation on Citation.js, and it has been a pleasant experience, despite some hiccups in the documentation engine. There are tutorials now, and a lot of the JSDoc comments have been improved. I still want to improve some things in the theme, like clickable header links (similar to GitHub and NPM behaviour), and showing sub-tutorials in the navigation.

Backwards compatibility

This release should be largely backwards compatible. If there are any regressions, please report them in the bug tracker.

Road-map

`v0.4`

v0.4 “is about making it easier to expand on input and output formats, possibly by creating schemes and methods parsing those schemes that can, say, convert BibJSON to CSL JSON based on a JSON file, something that can be stored independently of implementation.”

The idea is to allow for all kinds of plugins, both in input parsing and output formatting, to be registered on the Cite object, and to treat all internal parsers and formatters the same. Currently, input parser plugins are possible (see above), although they will be improved before v0.4. Output parsers will be made soon too, and will be backwards-incompatible, because of the change in the output option format.

`v0.5`

Currently, the plan is to use JSON-LD as the internal format in v0.5, while still keeping CSL-JSON as the internal scheme. Any input should then be converted to a list (or object) of fields, which will be added to the internal data store. Each field should be transformed to the CSL-JSON scheme individually, and added as a copy. As a result, there won’t be data loss when certain fields aren’t available in CSL-JSON, but are in the input and output schemes. It should also make it easier to write parsers for e.g. BibJSON.

Of course, edge cases should be taken care of. For example, there is no Wikidata property for the CSL-JSON field original-title, but that information can still be derived from the labels combined with the P364 (original language) property.

Monday, September 4, 2017

ctj rdf: Relations Between Conifers Mentioned in Articles

Below is part two of a small series on ctj rdf, a new program I made to transform ContentMine CProjects into SPARQL-queryable Wikidata-linked rdf.

Here’s a more detailed example. We are using my dataset available at 10.5281/zenodo.845935. It was generated from 1000 articles that mention ‘Pinus’ somewhere. This one has 15326 statements, whereof 8875 (57.9%) can be mapped, taking ~50 seconds. Now, for example, we can list the top 100 most mentioned species. The top 20:

Species	Hits
Pinus sylvestris	248
Picea abies	177
Pinus taeda	138
Pinus pinaster	120
Pinus contorta	96
Arabidopsis thaliana	91
Picea glauca	77
Pinus radiata	77
Pinus massoniana	72
Pseudotsuga menziesii	65
Oryza sativa	56
Pinus halepensis	56
Pinus ponderosa	55
Pinus banksiana	53
Pinus koraiensis	53
Picea mariana	52
Pinus nigra	51
Pinus strobus	46
Quercus robur	45
Fagus sylvatica	45
…	…

The top of the list isn’t surprising: mostly pines, other conifers, other trees, Arabidopsis thaliana which I’ve seen represented in pine literature before, and Oryza sativa or rice, which I haven’t seen before in this context.

Note that only 248 of the 1000 articles mention the top Pinus species. This may be because the query getting the articles was quite broad. Also note that this doesn’t take into account how often an article mentions a species; a caveat of the current rdf output.

Going of this list, we can then look what non-tree or even non-plant species are named most often in conjunction with a given species, or, in this case, a genus. Top 20:

Species 1	Species 2	Co-occurences
Picea abies	Pinus sylvestris	98
Picea abies	Pinus taeda	56
Arabidopsis thaliana	Pinus taeda	47
Picea glauca	Pinus taeda	43
Arabidopsis thaliana	Oryza sativa	43
Pinus pinaster	Pinus taeda	41
Pinus pinaster	Pinus sylvestris	41
Picea abies	Picea glauca	41
Arabidopsis thaliana	Picea abies	37
Pinus contorta	Pinus sylvestris	36
Betula pendula	Pinus sylvestris	36
Pinus sylvestris	Pinus taeda	36
Pinus nigra	Pinus sylvestris	35
Pinus contorta	Pseudotsuga menziesii	32
Picea abies	Pinus contorta	31
Picea abies	Pinus pinaster	30
Arabidopsis thaliana	Physcomitrella patens	30
Oryza sativa	Pinus taeda	29
Pinus sylvestris	Quercus robur	29
Picea abies	Picea sitchensis	28
…	…	…

Interesting to see that rice is mostly mentioned with Arabidopsis. Let’s explore that further. Below are species named in conjunction with Oryza sativa.

Species	Co-occurrences
Arabidopsis thaliana	43
Pinus taeda	29
Picea abies	24
Physcomitrella patens	23
Populus trichocarpa	21
Glycine max	20
Vitis vinifera	17
Picea glauca	17
Pinus pinaster	16
Selaginella moellendorffii	15
Pinus sylvestris	13
Triticum aestivum	12
Pinus contorta	10
Picea sitchensis	10
Ginkgo biloba	10
Pinus radiata	10
Ricinus communis	9
Amborella trichopoda	9
Medicago truncatula	9
Cucumis sativus	8
…	…

So attention seems divided between trees and more agriculture-related plants. More to explore for later.

View all posts in this series.

Sunday, September 3, 2017

ctj rdf: Part One

Below is part one of a small series on ctj rdf, a new program I made to transform ContentMine CProjects into SPARQL-queryable Wikidata-linked rdf.

ctj has been around for longer, and started as a way to learn my way into the ContentMine pipeline of tools, but turned out to uncover a lot of possibilities in further processing the output of this pipeline (1, 2).

The recent addition of ctj rdf expands on this. While there is a lot of data loss between the ContentMine output and the resulting rdf, the possibilities certainly are no less. This is mainly because of SPARQL, which makes it possible to integrate in other databases, such as Wikidata, without many changes in ctj rdf itself.

Here’s a simple demonstration of how this works:

We download 100 articles about aardvark (classic ContentMine example)
We run the ContentMine pipeline (norma, ami2-species, ami2-sequence)
We run ctj rdf

This generates data.ttl, which holds the following information:

Common identifier for each article (currently PMC)
Matched terms from each article (which terms/names are found in which article)
Type of each term (genus, binomial, etc.)
Label of each term (matched text)

Example data.ttl contents

Note that there are no links to Wikidata whatsoever. When we list, for instance, how often each term is mentioned in an article (in the dataset), we only have string values, a identifiers.org URI and some custom namespace URIs.

However, in this format, we can easily use the information in these papers in conjunction with the enormous amount of data in Wikidata with Federated Queries.

To accomplish this we first link the identifier in our dataset to the ones in Wikidata; then we link the matched text of the term to the taxon name in species in Wikidata. This alone already gives us a set of semantic triples where both values in every triple are linked to values in the extensive community-driven database that is Wikidata.

Example query, counting how often each species is mentioned, and mapping them to Wikidata

Results of the above query

Now say we want to list the Swedish name of each of those species. Well, we can, because that info probably exists on Wikidata (see stats below). And if we can’t find something, remember that each of those Wikidata values are also linked to numerous other databases.

Again, this is without having to change anything in the rdf output (to be fair, I forgot to list an article identifier in the first version of the program, but that could/should have been anticipated). Not having to add this data to the output has the added benefit of not having to make and maintain local dictionaries and lists of this data.

Some stats:

Number of articles: 100 (for reference)
Number of ‘term found in article’ statements: 1964
Number of those statements that map to Wikidata: 1293 (65.3% of total)
Number of mapped statements with Swedish labels: 1056 (81.7% of mapped statements, 53.8% of total)
Average number of statements per article: 19.64, 12.93 mapped

Note that not all terms are actually valid. A lot of genus matches are actually just capitalised words, and a lot of common species names are abbreviated, e.g. to E. coli, making it impossible to unambiguously map to Wikidata or any other database. This could explain the difference between found ‘terms’ and mapped terms.

View all posts in this series.

Sunday, August 27, 2017

Citation.js: Endpoint on RunKit

A while back I tweeted about making a simple Citation.js API Endpoint with RunKit.

Made a quick Citation.js API endpoint with @runkitdev : https://t.co/cdIc9bAEVX. Example https://t.co/GDSxaxTw4W
— Lars Willighagen (@larswillighagen) 23 juli 2017

Using the Express app helper and some type-checking code:

const express = require('@runkit/runkit/express-endpoint/1.0.0')
const app = express(exports)
const Cite = require('citation-js@0.3.0')
const docs = 'https://gist.github.com/larsgw/ada240ded0a78d5a6ee2a864fbcb8640'

const validStyle = style => ['bibtxt', 'bibtex', 'csl', 'citation'].includes(style) || /^citation-\w+$/.test(style)
const validType = type => ['string', 'html', 'json'].includes(type)

const last = array => array[array.length - 1]

const getOptions = params => {
  const fragments = params.split('/')
  let data, style = 'csl', type = 'html'

  // parse fragments
  // (got pretty complex, as data can contain '/'s)

  return {data, style, type}
}

app
.get('/', (_, res) => res.redirect(docs))
.get('/*', ({params: {0: params}}, res) => {
  const {data, style, type} = getOptions(params)
  const cite = new Cite(data)
  const output = cite.get({style, type})
  res.send(output)
})

Full code here. Makes an API like this:

https://$CODE.runkit.sh/$DATA[/$STYLE[/$TYPE]]

Where

$CODE is the API id (vf2453q1d6s5 in this case),
$DATA is the input data (DOI, Wikidata ID, or even a BibTeX string),
$STYLE (optional) is the output style,
and $TYPE (optional) is the output type (basically plain text vs html).

This makes it possible to link to a lot of Citation.js outputs:

Bib.TXT output from DOI: https://vf2453q1d6s5.runkit.sh/10.5281/zenodo.845934/bibtex
Generated citation from Wikidata: https://vf2453q1d6s5.runkit.sh/Q30000000/citation-apa
BibTeX output from Wikidata: https://vf2453q1d6s5.runkit.sh/Q30000000/bibtex
CSL-JSON output from Wikidata: https://vf2453q1d6s5.runkit.sh/Q30000000/csl

Citation.js Version 0.3 Released!

It’s been in beta since January 30, but here it is: Citation.js version 0.3. Below I explain the changes since the last blog post, and under that is a more complete change log. Also some upcoming changes.

Citation.js

Recent changes

Custom citation formatting

One of the remaining milestones for v0.3.0 was a better API for custom citation formatting, as outlined in the previous post. The exact implementation differs a bit, but is essentially the same. Docs can be found here.

Async

Cite#add() and Cite#set() now have asynchronous siblings, Cite#addAsync() and Cite#setAsync().
Both async and sync test cases are needed for full coverage, so I won’t just change one into another, as I previously suggested.

Parsers

The BibTeX parser got a big update, improving both the text-parsing code and the JSON-transforming code.
The Wikidata issue (sorry for the GIF) is fixed now, and I like the current API and code. The solution is inspired by the recent BibTeX refactoring.

Older changes

Also look at previous blog posts for more info. List goes from newer to older.

Features

Code coverage
Test case and framework updates
DOI support
CLI fixes
Custom sorting
Cite as an Iterator
Bib.TXT support
Give jQuery.Citation.js its own repository
Async support
Build size disc & dependency badges
Exposing internal functions and splitting up the main file into smaller files
Change of some internal Cite properties
Using an actual code style and use of ES6+
Exposing version numbers

Bugs

CLI bugs
Variable name typos
Output JSON is now valid JSON (I know, right?)
Wikidata ID parsing
Cite#getIds()
CLI not working
Wikidata tests not working
CSL not working
Browserify not working

Upcoming changes

Web scrapers and schema.org, Open Graph, etc.
Input parsing extensions
More coverage
More input formats, like better support for BibJSON, and new formats.

Saturday, August 26, 2017

Citation.js: DOI update and more stability

Finally, Citation.js supports DOIs. It took a while, but it’s finally there. One big ‘but’: synchronous fetching doesn’t work in Chrome. I’m still looking into that, but I should be recommending you to use Cite.async() anyway. Also in this blog post: more stability in Cite#get(), a welcome byproduct of using the DOI API, and looking forward (again).

DOI support

So, DOIs. That was (and is) a though one. Let me guide you through the process.

Initial development

I have been planning to add support for DOI input since the beginning of this year, going off the original feature request, and the pressure to implement it has grew right along me realising how important they are in the world of citations. Back then I thought it would be smart to query Wikidata for DOIs, as I had just finished some code on Wikidata ID input, that used regular API as well as the query part. More recently however, I learned about the Crossref API and, even more helpfully, the DOI Content Negotiation API, which combines the APIs from Crossref, DataCite and mEDRA. I’ll quote a piece from the docs:

Content negotiation allows a user to request a particular representation of a web resource. DOI resolvers use content negotiation to provide different representations of metadata associated with DOIs.

A content negotiated request to a DOI resolver is much like a standard HTTP request, except server-driven negotiation will take place based on the list of acceptable content types a client provides.

source

Basically, you just make a request to https://doi.org/$DOI (where hopefully obviously $DOI stands for the DOI you’re looking for) with an Accept header set to the format you want your data in. Conveniently, one of those formats is CSL-JSON, here charmingly named application/vnd.citationstyles.csl+json, but nonetheless the direct input format for Citation.js. This would allow me to only have to write code to extract DOIs from a whitespace-separated list (const dois = string.trim().split(/\s+/g)) and fetch them from the server (await Promise.all(dois.map(fetchDoi))), where fetchDoi is a simple function:

async function fetchDoi (doi) {
  const headers = new Headers({
    Accept: 'application/vnd.citationstyles.csl+json'
  })
  const response = await fetch('https://doi.org/' + doi, {headers})
  return response.json()
}

And Headers and fetch are built-in. Note that this uses some advanced syntax. Read more about async/await, Promise.all(), shorthand property names and const.

First problems

It wasn’t that simple. Heck, even getting the API to work took a (long) while and a lot of debugging other people’s code. You see, to make synchronous requests in Node.js I use sync-request, which is a synchronous wrapper around its asynchronous cousin, then-request, which is a wrapper around an also asynchronous but more low-level cousin, http-basic. They all use the same scheme of options. Because of that, sync-request passes its options down all the way to http-basic, so options from http-basic but not in sync-request can be used too. Turns out, http-basic has an option that removes all headers when a request is redirected, but it isn’t documented in sync-request. This removes, among other things, the Accept header, which, on omission, produces an HTTP 501 code not documented in the API.

Let’s just say I filed a feature request to mention this in the docs.

The actual code didn’t need much change, so soon I got the first JSON response, and since the API supports CSL-JSON out of the box, I didn’t need to do much else, except building some input recognition infrastructure.

CSL-JSON problems

However, nothing is perfect, and so is that out-of-the-box support of CSL-JSON. This was actually picked up quite well, and I’m happy with the outcome. Basically, extra code needed to fix this consists of two parts. One part transforms some invalid but essential parts of the API response to its CSL-JSON counterpart. The second part filters invalid and less essential parts out of the data, and basically acts as a safeguard for methods depending on certain variables having certain types. More on this later.

Conclusion

DOIs work. Because of all the ranting, I haven’t shown you the api, but it’s pretty familiar. To use the CLI, do this:

> npm i -g citation-js
# to install the new version of the command
# might need to be run with admin rights

> citation-js -t 10.1021/ja01577a030 -f string -s citation-apa
# Fetches data from doi: 10.1021/ja01577a030
# in the format: string; in the formatted style: apa
 
  Hall, H. K., Jr. Correlation of the Base Strengths of Amines1. Journal of the American Chemical Society, 79(20), 5441–5444. https://doi.org/10.1021/ja01577a030

To use the API, do this:

const Cite = require('citation-js')

// For synchronous use
const data = new Cite('10.1021/ja01577a030')
const output = data.get({
  type: 'string',
  style: 'citation-apa'
})

/* output =
  Hall, H. K., Jr. Correlation of the Base Strengths of Amines1. Journal of the American Chemical Society, 79(20), 5441–5444. https://doi.org/10.1021/ja01577a030
*/

// For asynchronous use
Cite.async('10.1021/ja01577a030').then(data => {
  const output = data.get({
    type: 'string',
    style: 'citation-apa'
  })
  
  /* output =
    Hall, H. K., Jr. Correlation of the Base Strengths of Amines1. Journal of the American Chemical Society, 79(20), 5441–5444. https://doi.org/10.1021/ja01577a030
  */
})

Stability in `Cite#get()`

As I mentioned earlier, I had to write some code to filter out invalid but non-essential props from the CSL-JSON. Otherwise certain methods used by Cite#get() throw errors, as they except these props to have certain types. Because of this filtering function, I don’t have to type-check ever again, I hope. The implementation is pretty simple. Specially structured props like author and other names, and issued and other dates, get caught and handled on their own. Other props get checked against a map of expected data types and removed if they don’t match, and if the bestGuessConversions flag is set, if they also can’t be converted reliably. Note that, as of now, all Cite.get.*() functions expect CSL-JSON cleaned by Cite.parse.csl().

Version 0.3.0

Implementing DOI input was one of the big milestones on the way to version 0.3.0, besides exposing internal methods, making async input parsing available and generally making the browser version and the CLI less bad. Now that those three things are done, I think it’s a good moment to see what still needs to be done for the v0.3.0 release.

Better custom formatted citations

Using formatted citations is fine, but when you try to use custom ones, either because the style guide you want or need to follow isn’t built-in, or because you have some special use case, things may get confusing. Currently, the API works (should work) like this: when you pass a template in the Cite#get() options (important: not in the new Cite(<data>, <options>) options), it gets registered in a register of citeproc-js engines. After that, you can use it by referencing the name you used in the regular style option. This would work better with an API similar to this:

const template = '...'
const templateName = 'custom'

// Suggested new API
Cite.CSL.template.register(templateName, template)

// Nothing new, will stay the same
const data = new Cite(...)
data.get({
  type: 'html',
  style: 'citation-' + templateName
})

Also, sometimes you just want to append or prepend some text or HTML frames, and implementing that with CSL templates takes time and effort and the result isn’t always that pretty, as templates don’t support direct HTML. There will be a new API for that too, probably like this:

// prepends the id like this: "[$ID]: "
const prepend = ({id}) => `[${id}]: `
// appends an altmetric widget
const append = ({DOI}) => `<span class='altmetric-embed' data-doi='${DOI}'></span>`

// Nothing new, will stay the same
const data = new Cite(...)
data.get({
  type: 'html',
  style: 'citation-' + templateName,
  // Suggested new API
  // both properties either function or constant string
  append: append,
  prepend: prepend
})

One thing I still have to look at is extending the wrapping HTML elements outputted by citeproc-js.

Use asynchronism better

Now that we have asynchronous input parsing with Cite.async(), it may be interesting to look at asynchronous output formatting as well. There are also some functions that can use Cite.async() but don’t, like Cite#add() and Cite#set(), and the test cases are generally synchronous, with some special cases for async, while it should probably be the other way around. Adding a coverage tester will assist in determining which synchronous test cases can go.

Refactoring

There are still some things I’d like to refactor, like the Wikidata parser. I don’t like it right now, especially the hack to merge different types of authors.

After v0.3.0

Since we’re almost at the end of v0.3.0, I’ll also outline some of my plans for future versions.

BibJSON input (already partly supported)
Extensions (input parsing, output, etc.)
Zotero input (maybe not worth the work, as they already support export to CSL)
Scraper (could just be getting DOI and using my existing work)
Coverage testing and an expansion of test cases
More work on the dependants:
- citation.js-replacer with shadow DOM
- Async citation.js-showdown & markdown extensions for other engines
- citation.js-form v0.3

Sunday, June 4, 2017

Microscopic photography: Part 1

I got to make photographs of pre-made specimens with a microscope, and I wanted to show some of them. This is part one of a longer series; there were a lot of specimens.

Below are details of the cross section of a basswood stem.

Detail

Outer layer

Center

Human artery tissue.

Human bone tissue (with some unidentified dark stuff).

More next week!

Monday, May 22, 2017

Citation.js: Async, Showdown and Bib.TXT

I worked on updates for Citation.js in the past few weeks, and I thought I'd go over them in this post.

Async

First of all, Citation.js now has support for asynchronous parsing, so it won't lag your app as much when it uses e.g. the Wikidata API. This is good, as synchronous requests are not only blocking your app, but also deprecated in most major browsers. This allows for more development on input formats where additional data needs to be fetched, like a DOI as input. I could have done this before, but this order made more sense to me. The API is simple:

// With callback

Cite.async($INPUT, $OPTIONS, function (data) {
  data // instance of Cite
  
  // Further manipulations...
  console.log(data.get())
})

// Or with a promise, like this:

Cite.async($INPUT, $OPTIONS).then(function (data) {
  data // instance of Cite
  
  // Further manipulations...
  console.log(data.get())
})

// Where $INPUT is input data and $OPTIONS is options
// $OPTIONS is optional in both cases

The promise-returning part is good when using the async await syntax, i.e. var data = await Cite.async($INPUT, $OPTIONS).

Extensions

Unfortunately, no actual extensions to the Cite object. (However, extensions in input parsing is planned in v0.4, and the API for output formatting is getting a rework in v0.3 (this and more)).

I'm talking about citation.js-showdown, citation.js-form and citation.js-replacer (coming soon). Citation.js-form is just jquery.citation.js in its own repository, citation.js-replacer will be a small tool to easily put references on your site, without having to bother with writing scripts, but, as a consequence, with fewer options. Citation.js-showdown, however, is a (functioning) Showdown plugin, that makes it easier to make references and bibliographies in a markdown document.

Bib.TXT

A few days ago I heard about Bib.TXT, and since implementing it in Citation.js only meant parsing a relatively simple text format (all the fields are the same as the already supported BibTeX), I thought: Why not? So there you have it, Bib.TXT support in Citation.js. That's good, because, among other things, you don't have to bother with weird (La?)TeX Unicode workarounds. And with Citation.js, you can easily convert Bib.TXT to BibTeX. Below as a command:

$ npm i -g citation-js@0.3.0-7
$ citation-js --input inputFile.txt --output-format string --output-style bibtex > out.bib

Homepage improvement

I changed some things on the Citation.js homepage. It should now be more responsive and have more room for expansion, and I added a demo and a small list of the "extensions" mentioned above. Oh, and the README has the same banner as the page now.

Citation.js banner

Saturday, April 29, 2017

New homepage design: New content, Material Design, Angularjs and more

My new homepage is finally finished (for now). I say finally, because it has taken a while before it actually felt done. You know how the first 90% of the work takes 90% of the time, but the last 10% of the work takes another 90% of the time? I just had to finish writing a single, small piece of text and it would be complete, but it took about 2 weeks. Anyway, that's all done now, and the result can be seen at larsgw.github.io.

Homepage

The design is based on the principles of Material Design using MDL, which is a CSS+JS framework for making Material Design sites. Most of the content is displayed with AngularJS, which uses runtime Pug-like HTML templating and more.

Extra content is added as well, including more projects and descriptions. Unfortunately, I had to get rid of my Twitter feed, as it didn't fit in the new design, but there are several sites to view my tweets, including Twitter and this exact site. Also, the old site (with my feed) is still available, and displays the new content as well: larsgw.github.io/old.

There are some other things hidden here and there, some really obvious, some not. That's basically it. Hope you like it.

Wednesday, April 26, 2017

Final Report: Analysing and visualising data from papers about conifers

Originally posted on the ContentMine blog.

Lars Willighagen, orcid:0000-0002-4751-4637

Final Report of my fellowship at the ContentMine.

Proposal

My proposal was to extract facts about various conifer species by analysing text from papers with software suited for analysing text and the tools provided by the ContentMine. These facts were then to be converted into JSON, and then viewable with an HTML (+CSS/JS) interface. Expected statements were like: 'Picea glauca is a species of the genus Picea', which could be parsed to the triple:Picea glauca; property:genus; subject:Picea.

Work

The main outcome of this project is a series of programmes converting tables from research articles into Wikidata statements. The workflow is as follows. First, papers matching a user-provided query are fetched by the ContentMine's getpapers. Second, the tables are extracted from the fetched papers and converted to assertions. This is done by filling empty cells in tables and then treating each row as an object, the first column being the name and the others property-value pairs. Different table designs are currently parsed in the same way, resulting in incorrect extraction of data, something that can be accommodated for by normalising the table structure beforehand. The resulting assertions are then converted to JSON, currently in a custom scheme, to allow the next steps.

Finally, the JSON assertions are visualized in an HTML GUI. This includes a stepper form (see picture) where you can curate the assertion, link identifiers, and add it to Wikidata.

Stepper form for curating assertions

Source code: https://github.com/larsgw/ctj-factvis
Demo: https://larsgw.github.io/ctj-factvis

Getting these assertions from text, as I proposed, was harder. Tools I expected to find included in ContentMine software were nowhere to be found, but were planned, so actually implementing them myself did not seem a good use of my time. Luckily, the literature corpus does not actually contain that many statements about physical properties of conifers in plain text as I originally expected: most are in tables, figures or in supplementary files, leading me to using those instead. The nice thing is that one of the main focuses of the ContentMine is parsing tables from PDF, so this will definitely be of general use.

Other work

During the project and to explore the design of the ContentMine, additional related components were developed:

ctj: program to convert and re-order AMI data to JSON, making it easier to read in JavaScript (mainly good for web applications);
ctj-cardlists: program to view AMI JSON (see above) in a Web GUI (demo); and
Citation.js: added functionality to parse BibJSON (used for quickscrape output) into CSL, for further formatting. See blog post.

These first two simplified handing AMI output in the browser, while the third makes it easier to display references in common formats.

Dissemination

All source code of the project outcomes is available on GitHub:

Progress was communicated during the project via the ContentMine Discourse page, on my personal blog (~20 posts), and on the general ContentMining blog (2 long posts).

Future work

The developed pipeline works but is not perfect.The pipeline to parse tables mentioned above requires further generalisation. This defines some logical next steps: fixes:

Finally adding it as an NPM module, making it (way) easier for people to use it;
Making searching easier in the HTML GUI (will need work further upstream too). Currently the list of assertions are split into pieces, making it hard to find anything. This can be fixed with a search index;
Normalising table structures to support more designs, rendering assertion extraction more reliable;
Making the process of curating assertions and linking identifiers easier by linking more identifiers, and showing context, i.e. the original tables; and
Some small performance and UX things.

Another important thing that is too big for a single bullet point, is annotating abbreviations and references in the document before extracting the tables. It's easier to curate statements like '[1] says this and this' when you know '[1]' references some known article. Another example: while a statement containing 'P. glauca' says nothing (there are 66+ species using that abbreviation), the article probably says which one it is somewhere outside the table, something that can be picked up if you annotate these before taking them out of context. This makes the interactive stepper form currently a necessity.

Evaluation

As noted, the work is far from done. Currently, it mainly shows a glimpse of what is possible had I spent more time on writing code. Short conclusions: CTJ is unpolished and slow. Because of a lack of customisation options, such as what data to use, you will almost always need to write custom code to not have to include tons of unnecessary data in your resulting JSON.

CTJ-Cardlists is actually pretty nice. It is slow, and it does not really show relations, but it does show an interesting overview of the literature corpus, like how often species are mentioned and with what they are mentioned together most of the time. You can easily draw reasonable conclusions like how often species names are misspelled. However, it would be more useful for this to have SQL queries or something similar. CTJ-Factvis shows even more potential, with the Wikidata integration. I do need to pay more attention to the fact that those assertions are alleged facts, and not regular ones, as I called them in earlier blog posts.

Fellowship

In general, the fellowship went pretty well for me. In retrospect, I did a lot of the things I wanted to do, even though that throughout the project it felt like there was so much left to do, and there is! I am really excited about the possibilities that emerged during the fellowship, and even in the last weeks. How cool would it be to extend this project with entire Web API's and more? This is, for a big part, thanks to the support, feedback, and input of the amazing ContentMine team during the regular meeting, and the quick responses to various software issues. I also enjoyed blogging about my progress on my own blog and on the ContentMine blog.

Sunday, March 26, 2017

Citation.js: BibJSON

Citation.js now supports BibJSON. How I did that without actually updating Citation.js? Well, apparently I supported it all along. I've supported the quickscrape output format since July last year, and that turned out to be BibJSON. How convenient. I'll update the demo and docs to reflect this revelation (currently it just says "quickscrape's JSON scheme"), and, now that I can find actual documentation, some improvements to the parser. It's a good candidate for a new output format too.

Some side notes on updates v0.3.0-0 to v0.3.0-2: these are prerelease updates, making it possible to use code before I have fixed all the issues and added all the features I promised for version 0.3. These updates fixed a lot of file organization problems; next updates will restructure the Cite object and fix tests.

Sunday, February 26, 2017

SVM: Developing a brand new 3D model language for the Web

When I started programming a few years ago, one of my first projects was linked to a project for school. Our team was making a presentation, and at some point I decided I wanted to program it myself, for the web. This resulted in the version of a project which, to this day, doesn't have a name (but soon will have).

After a few months I wanted to expand this. With some research and trial and error in the area of CSS 3D Transforms, I made a program similar to impress.js. Regular slides, in the web, in 3D, but with some unique things. Of course, it wasn't as refined yet, but this was only the start.

These unique things^[1] were 3D models, that you can view in your very own web browser. This on its own isn't that unique; There are plenty of libraries that do that, like A-Frame and xml3d. The special thing of my 3D models are that they're pure HTML and CSS. No canvas or WebGL needed, just a relatively modern browser. This opens a whole range of possibilities: viewing HTML structures and CSS styles in your default debugger; selecting text directly from the model; embedding audio, video, pages and anything else you can put in HTML; and, of course, the possibility to combine these models and the slides, as they use CSS 3D Transforms as well.

There are some disadvantages to this. HTML+CSS only allows for 2D elements, so to make a cube, you'd need 6 elements, a wrapper element and 10+ lines of CSS. And if you want to change the height, you need to change that one value in a lot of places. To solve that, I've started developing SVM, which stands for Scalable Vector (3D) Models. The name is very similar to SVG, and that's how I intended it: XML-based, simple, and with lots of built-in shapes.

The first (beta) version is still in development, but I can list some of the features:

Built-in solids: Cubes, cuboids, regular prisms, cylinders (to an extent), regular pyramids, regular right frustums, cones (to an extent), irregular convex and concave prisms using SVG Paths (!), and simple spheres
Planes, arcs, SVG Path-based curves
Groups (with names)
Transformation (translation, rotation, scaling) on any element (or at least the currently existing ones)
An element to include groups/components

Some of the features in action, including automatic shadow, which will adapt to the orientation of the elements in the future^[2]

It may not be as refined yet, but I think there's a lot of potential. Currently, the main hurdle is performance and the accompanying visual issues. For example, on my laptop, Chrome starts lagging and breaking apart^[3] at around 450 elements, and keep in mind that even a single cube results in six elements. Also keep in mind: it works fine for up to around 1200 elements on my PC.

^[1]: As far as I know, anyway.
^[2]: Currently, it doesn't, as you can see on the yellow cylinder next to the red cube.
^[3]:

As you can see, parts of elements just disappear

Saturday, February 4, 2017

Citation.js Version 0.2.11: jQuery and Looking Forward

A few weeks ago I published version 0.2.11 of Citation.js. The main change was the addition of jQuery.Citation.js, updated for version 2 of Citation.js. jQuery.Citation.js is a small jQuery plugin to build simple forms where you can fill in metadata, which gets translated to CSL-JSON. The configuration options are currently quite limited, so it only really works as a demo for Citation.js itself.

Citation.js logo

Since then (the current version is 0.2.15), the improvements have been bug fixes and the addition of (more support for) certain fields in both Wikidata and BibTeX. More interesting is what's going to happen in the next few releases.

Version 0.3

I've been planning version 0.3 for a while now and these things are probably going to be in it:

Asynchronous parsing: Parse input asynchronously, to allow asynchronous file requests, mainly for Wikidata.
More BibTeX: Publication type-specific fields, for example, should be parsed accordingly. I recently found a new guide to BibTeX, which should help as well.
Helper methods: Expose certain sub-parsing methods, like parseName and parseDate.
Structure Cite better: De-expose data that shouldn't be changed, add version information to Cite, etc.
Deprecate log: It's more or less useless. I'll add an option to enable it.
Structure code better: It's a mess, and things are broken. I'll change some file locations and add browserify etc.

Friday, December 22, 2017

Release

Input plugins

Adding a format

Changing parsers

Disabling a format

Docs

Backwards compatibility

Road-map

v0.4

v0.5

Monday, September 4, 2017

Sunday, September 3, 2017

Sunday, August 27, 2017

Recent changes

Custom citation formatting

Async

Parsers

Older changes

Features

Bugs

Upcoming changes

Saturday, August 26, 2017

DOI support

Initial development

First problems

CSL-JSON problems

More problems, from an unexpected source

Conclusion

Stability in Cite#get()

Version 0.3.0

Better custom formatted citations

Use asynchronism better

Refactoring

After v0.3.0

Sunday, June 4, 2017

Monday, May 22, 2017

Async

Extensions

Bib.TXT

Homepage improvement

Saturday, April 29, 2017

Wednesday, April 26, 2017

Lars Willighagen, orcid:0000-0002-4751-4637

Proposal

Work

Other work

Dissemination

Future work

Evaluation

Fellowship

Sunday, March 26, 2017

Sunday, February 26, 2017

Saturday, February 4, 2017

`v0.4`

`v0.5`

Stability in `Cite#get()`