Monday, April 30, 2018

Debugging: Deja Vu

Debugging: Deja Vu

I’m going to share some stories on debugging with you, because I’m proud of them. After writing up the first story, I’m no longer particularly proud, but I still want to share the story. Here’s the first: a bug that seemed quite familiar.

After some trouble with mocking API requests I decided that supporting mocking in the browser isn’t as important as supporting mocking at all, so I installed mock-require (I didn’t get proxyquire to work). Now, to confirm that the test bundle script actually omitted the API mocking code from the bundle, I loaded the test suite in the browser. Guess what? Errors everywhere! Or rather, error everywhere. Every test case gave the same error:

TypeError: Cannot assign to read only property 'length' of function 'function () {
          old.apply(self, arguments);
        }'
    at new Assertion (/build/test.citation.js:1810:30)
    at new Assertion (/build/test.citation.js:1799:25)
    at new Assertion (/build/test.citation.js:1799:25)
    at new Assertion (/build/test.citation.js:1799:25)
    at expect (/build/test.citation.js:1775:12)
    at Context._callee2$ (/build/test.citation.js:3582:35)
    at tryCatch (/build/citation.js:7891:17)
    at Generator.invoke [as _invoke] (/build/citation.js:8064:22)
    at Generator.prototype.(anonymous function) [as next] (/build/citation.js:7934:21)
    at step (/build/test.citation.js:3542:221)

Naturally, like a good programmer, I immediately googled the error message in conjunction with the various tools and frameworks I used for this bundle, instead of looking at the stack trace showing that something’s wrong with expect.js. Anyway, after some time, I found a GitHub issue describing exactly the problem I was having. Scrolling through the responses, I was stunned:

larsgw commented on Jul 29, 2017
+1: Same issue, but for all assertions

Somehow, I reported the same issue 10 months ago, but the site, running a bundle from 2 months old didn’t have the problem. Apparently, I had fixed the issue earlier, but forgotten about the solution, and I couldn’t figure out what that solution was. So I started debugging. First of all, the offending code wasn’t different from any other working bundle. It registered a bunch of functions as properties to a function, and it choked on the length property. Makes sense, the length property isn’t writable on functions. But running some simple test code showed this shouldn’t be the problem:

let f = function () {}
console.log(f.length)	// 0
f.length = 3			// (no error)
console.log(f.length)	// 0

Sure, it didn’t actually do anything, but it didn’t throw an exception either. Besides, this was the exact same lines of code as in the GitHub repo, so how could that be the issue? The only thing I could do was comparing with the working examples. Diffing my bundle against the published one, which was difficult, because it’s generated code. Checking out commits until I found out which one worked, which was a pain because I repeatedly forgot to reinstall dependencies, something that could be critical.

Sidebar (minor spoilers): the previous time I ‘solved’ the issue, it was much easier, because it was the first time setting the system up, so instead of referring to older commits as working examples, I looked at the docs.

After a long while I realised what the workaround was earlier: instead of bundling expect.js with Browserify, I included it as a script (as recommended) and created a wrapper that exposed the expect.js module and simply grabbed and exported the global expect variable exposed by the script. I thought this was because the order of requiring the scripts mattered, but some testing proved this wasn’t the case. No, actually including expect.js into a Browserify bundle with Babelify transform on, or even simply running it through the Babel compiler caused the error.

Back to diffing bundles I guess: what are the differences between a babel-ed file and its source code if there isn’t really any syntax that needs to be transformed, or APIs that need to be polyfilled?

Turns out, not much. Between those files, the only real difference was the location (or with comments: false the existence) of comments, and some style differences caused by Babel’s code generator.

And 'use strict'.

Apparently, 'use strict' makes assignments that otherwise fail silently throw an error. If I had read that documentation earlier, or if I had payed proper attention to the Function.prototype.length docs (linked above), I would have known. Now it’s just a boring ending to a long journey. But hey, at least I learned some stuff.


Solving this issue requires either a big change in the internals of a toolkit that hasn’t had an update in 4(!) years, or a workaround. I don’t want to use the workaround of including two extra files anymore, now that I know what is causing the issue, but the other workaround proves to be a problem itself, involving outdated documentation and a bug in Babelify. More on that later.

On a related note: I’m working on a new release for Citation.js, improving the parsing plugin system. The API should be pretty stable now, apart from the namespaces being prone to change, so I might change the schedule to one update with all current API changes instead.

Tuesday, February 6, 2018

DNA Project: Introduction

Introduction

I’ve always had an interest in biology. When I was 11, I started collecting all sorts of info on conifers into a “book”. More recent projects include my ContentMine project, and an ongoing project on microscopic photography.

Plant cells Example photo (detail of the cross section of a basswood stem)

To my regret, however, I only recently got a decent introduction to the chemistry of DNA, and even then, there’s a lot left uncovered—obviously, it’s an introduction after all. It also raised a lot of questions, and gave me some new ideas, like: What if you were to thoroughly analyse an entire genome, figure out what does what, and then just… golfed it? The genome. What if you were to minify a genome? Those things were a bit out of scope of the class, however.

Luckily, there’s a thing called Profielwerkstuk in Dutch high school education. For your Profielwerkstuk you are ought to spend at least 80 hours on research into a topic of your choice. This includes setting up the project, reading literature, usually performing an experiment, writing the article, and, in the end, presenting your project.

Of course, I could spend that time on Citation.js, or on one of my other projects, but I’m probably going to spend that time on them anyway, and I want to learn something new. A perfect opportunity for me to answer some of those questions, and learn some more about DNA in the progress.

Later, I learnt about this paper, which is basically exactly the idea I mentioned above. Seeing the resources involved, it seems a bit out of the scope of my profielwerkstuk as well. However, there are plenty of other interesting things to figure out.

In the next few weeks, I will be reading literature to a) expand my basic knowledge on the subject past introduction-level and b) see what I could learn through experimentation, and which experiments I could perform in this project. My reading material includes:

I will also be putting out a series of blogposts, about inspirations, thoughts that came up while reading, ideas for the rest of the project, and more. Of course, I will also blog about the project itself.

In the meantime, if you are or know someone who can help me with actually performing those experiments (editing/sequencing DNA), please do contact me. General ideas, tips and feedback are of course also welcome.


This post is part of a series.

Sunday, February 4, 2018

Microscopic photography: Part 2

I promised more photos in the previous post, so here they are.

Penicillium:

Penicillium
Detail

Penicillium with conidiophores
With conidiophores

Penicillium (conidiophore)
Detail of conidiophores

More fungi:
Fungus cells

Fungus cells

Cat brains:
Cat brain cells


This post is part of a series.

Friday, December 22, 2017

Citation.js Version 0.4 Beta: New Docs and Input Plugins

Citation.js Version 0.4 Beta: New Docs and Input Plugins

It’s been a while. Really. But now it’s back, with a new release: v0.4.0-0, the v0.4 beta. Below I explain some of the changes in this release, and then the road-map of Citation.js for v0.4 and v0.5. Also, Citation.js has a DOI now:

DOI

Also, @jsterlibs tweeted about Citation.js.

Release

Input plugins

The main change in this release, is the addition of input plugins. This is the first step towards releasing v0.4, as explained below. Although there’s specific documentation available here and here, I’ll put an example here as well, as the API can be confusing.

The API for registering input plugins will be changed once or twice before the release of v0.4, once to make it less complex and weird, and possibly a second time to incorporate output plugins.

Adding a format

Say you wanted to add the RIS format (I should probably do that sometime). First, let’s define a type, to set things off.

const type = '@ris/text'

I can’t really define the actual parser here, but I’ll add a variable in the example code.

const parse = data => { /* return CSL-JSON or some format supported by any other parser */ }

Testing if a given string is in the RIS format, we’ll use regex. This regex matches if any line starts with two alphanumerical characters, two spaces and a hyphen. That’s a pretty fuzzy match; all lines should start with that sequence. However, this is just an example, and proper regex would be less elegant.

const risRegex = /^\w{2}\ {2}-/gm

Now, let’s define the dataType of RIS input. When using regex to test input, the dataType is automatically determined to be 'String' anyway, but for the sake of being clear:

const dataType = 'String'

Now, to combine it all:

Cite.parse.add(type, {
  parse,
  parseType: risRegex,
  dataType
})

Changing parsers

Now, say someone else wrote that code above, and you need to use it without modifying it, but you want a better regex? That can be arranged:

Cite.parse.add(type, {
  dataType: 'String',
  parseType: /^(?:\w{2}\ {2}-.*\n)+(\w{2}\ {2}-.*)?$/g
})

Because the options dataType, parseType, elementConstraint and propertyConstraint are all treated as one thing, you need to pass everyone of those when replacing the type checker. dataType is still not actually mandatory in this example, but is passed to demonstrate this.

Disabling a format

There is currently no way to remove a format.

In this scenario, some plugin registered a new format to get citation data from github.com URLs. Unfortunately, it doesn’t work. Instead of recognising the URL as a GitHub-specific URL, the type is @else/url, the generic URL type. This is because the generic version was registered earlier, and there is not yet a category for generic types (see #104). If you don’t use the @else/url type checker anyway, you can disable it like this:

Cite.parse.add('@else/url', {dataType: 'String', parseType: () => false})

I’m not sure why the dataType is needed here, as you’re disabling the parser anyway, but it doesn’t seem to work without it.

Docs

Proper CLI docs

Recently, I’ve been updating the documentation on Citation.js, and it has been a pleasant experience, despite some hiccups in the documentation engine. There are tutorials now, and a lot of the JSDoc comments have been improved. I still want to improve some things in the theme, like clickable header links (similar to GitHub and NPM behaviour), and showing sub-tutorials in the navigation.

Backwards compatibility

This release should be largely backwards compatible. If there are any regressions, please report them in the bug tracker.

Road-map

v0.4

v0.4 “is about making it easier to expand on input and output formats, possibly by creating schemes and methods parsing those schemes that can, say, convert BibJSON to CSL JSON based on a JSON file, something that can be stored independently of implementation.”

The idea is to allow for all kinds of plugins, both in input parsing and output formatting, to be registered on the Cite object, and to treat all internal parsers and formatters the same. Currently, input parser plugins are possible (see above), although they will be improved before v0.4. Output parsers will be made soon too, and will be backwards-incompatible, because of the change in the output option format.

v0.5

Currently, the plan is to use JSON-LD as the internal format in v0.5, while still keeping CSL-JSON as the internal scheme. Any input should then be converted to a list (or object) of fields, which will be added to the internal data store. Each field should be transformed to the CSL-JSON scheme individually, and added as a copy. As a result, there won’t be data loss when certain fields aren’t available in CSL-JSON, but are in the input and output schemes. It should also make it easier to write parsers for e.g. BibJSON.

Of course, edge cases should be taken care of. For example, there is no Wikidata property for the CSL-JSON field original-title, but that information can still be derived from the labels combined with the P364 (original language) property.

Monday, September 4, 2017

ctj rdf: Relations Between Conifers Mentioned in Articles

Below is part two of a small series on ctj rdf, a new program I made to transform ContentMine CProjects into SPARQL-queryable Wikidata-linked rdf.


Here’s a more detailed example. We are using my dataset available at 10.5281/zenodo.845935. It was generated from 1000 articles that mention ‘Pinus’ somewhere. This one has 15326 statements, whereof 8875 (57.9%) can be mapped, taking ~50 seconds. Now, for example, we can list the top 100 most mentioned species. The top 20:

Species Hits
Pinus sylvestris 248
Picea abies 177
Pinus taeda 138
Pinus pinaster 120
Pinus contorta 96
Arabidopsis thaliana 91
Picea glauca 77
Pinus radiata 77
Pinus massoniana 72
Pseudotsuga menziesii 65
Oryza sativa 56
Pinus halepensis 56
Pinus ponderosa 55
Pinus banksiana 53
Pinus koraiensis 53
Picea mariana 52
Pinus nigra 51
Pinus strobus 46
Quercus robur 45
Fagus sylvatica 45

The top of the list isn’t surprising: mostly pines, other conifers, other trees, Arabidopsis thaliana which I’ve seen represented in pine literature before, and Oryza sativa or rice, which I haven’t seen before in this context.

Note that only 248 of the 1000 articles mention the top Pinus species. This may be because the query getting the articles was quite broad. Also note that this doesn’t take into account how often an article mentions a species; a caveat of the current rdf output.

Going of this list, we can then look what non-tree or even non-plant species are named most often in conjunction with a given species, or, in this case, a genus. Top 20:

Species 1 Species 2 Co-occurences
Picea abies Pinus sylvestris 98
Picea abies Pinus taeda 56
Arabidopsis thaliana Pinus taeda 47
Picea glauca Pinus taeda 43
Arabidopsis thaliana Oryza sativa 43
Pinus pinaster Pinus taeda 41
Pinus pinaster Pinus sylvestris 41
Picea abies Picea glauca 41
Arabidopsis thaliana Picea abies 37
Pinus contorta Pinus sylvestris 36
Betula pendula Pinus sylvestris 36
Pinus sylvestris Pinus taeda 36
Pinus nigra Pinus sylvestris 35
Pinus contorta Pseudotsuga menziesii 32
Picea abies Pinus contorta 31
Picea abies Pinus pinaster 30
Arabidopsis thaliana Physcomitrella patens 30
Oryza sativa Pinus taeda 29
Pinus sylvestris Quercus robur 29
Picea abies Picea sitchensis 28

Interesting to see that rice is mostly mentioned with Arabidopsis. Let’s explore that further. Below are species named in conjunction with Oryza sativa.

Species Co-occurrences
Arabidopsis thaliana 43
Pinus taeda 29
Picea abies 24
Physcomitrella patens 23
Populus trichocarpa 21
Glycine max 20
Vitis vinifera 17
Picea glauca 17
Pinus pinaster 16
Selaginella moellendorffii 15
Pinus sylvestris 13
Triticum aestivum 12
Pinus contorta 10
Picea sitchensis 10
Ginkgo biloba 10
Pinus radiata 10
Ricinus communis 9
Amborella trichopoda 9
Medicago truncatula 9
Cucumis sativus 8

So attention seems divided between trees and more agriculture-related plants. More to explore for later.


View all posts in this series.

Sunday, September 3, 2017

ctj rdf: Part One

Below is part one of a small series on ctj rdf, a new program I made to transform ContentMine CProjects into SPARQL-queryable Wikidata-linked rdf.


ctj has been around for longer, and started as a way to learn my way into the ContentMine pipeline of tools, but turned out to uncover a lot of possibilities in further processing the output of this pipeline (1, 2).

The recent addition of ctj rdf expands on this. While there is a lot of data loss between the ContentMine output and the resulting rdf, the possibilities certainly are no less. This is mainly because of SPARQL, which makes it possible to integrate in other databases, such as Wikidata, without many changes in ctj rdf itself.

Here’s a simple demonstration of how this works:

  1. We download 100 articles about aardvark (classic ContentMine example)
  2. We run the ContentMine pipeline (norma, ami2-species, ami2-sequence)
  3. We run ctj rdf

This generates data.ttl, which holds the following information:

  • Common identifier for each article (currently PMC)
  • Matched terms from each article (which terms/names are found in which article)
  • Type of each term (genus, binomial, etc.)
  • Label of each term (matched text)
Example data.ttl contents

Note that there are no links to Wikidata whatsoever. When we list, for instance, how often each term is mentioned in an article (in the dataset), we only have string values, a identifiers.org URI and some custom namespace URIs.

However, in this format, we can easily use the information in these papers in conjunction with the enormous amount of data in Wikidata with Federated Queries.

To accomplish this we first link the identifier in our dataset to the ones in Wikidata; then we link the matched text of the term to the taxon name in species in Wikidata. This alone already gives us a set of semantic triples where both values in every triple are linked to values in the extensive community-driven database that is Wikidata.

Example query, counting how often each species is mentioned, and mapping them to Wikidata
Results of the above query

Now say we want to list the Swedish name of each of those species. Well, we can, because that info probably exists on Wikidata (see stats below). And if we can’t find something, remember that each of those Wikidata values are also linked to numerous other databases.

Again, this is without having to change anything in the rdf output (to be fair, I forgot to list an article identifier in the first version of the program, but that could/should have been anticipated). Not having to add this data to the output has the added benefit of not having to make and maintain local dictionaries and lists of this data.

Some stats:

  • Number of articles: 100 (for reference)
  • Number of ‘term found in article’ statements: 1964
  • Number of those statements that map to Wikidata: 1293 (65.3% of total)
  • Number of mapped statements with Swedish labels: 1056 (81.7% of mapped statements, 53.8% of total)
  • Average number of statements per article: 19.64, 12.93 mapped

Note that not all terms are actually valid. A lot of genus matches are actually just capitalised words, and a lot of common species names are abbreviated, e.g. to E. coli, making it impossible to unambiguously map to Wikidata or any other database. This could explain the difference between found ‘terms’ and mapped terms.


View all posts in this series.

Sunday, August 27, 2017

Citation.js: Endpoint on RunKit

A while back I tweeted about making a simple Citation.js API Endpoint with RunKit.

Using the Express app helper and some type-checking code:

const express = require('@runkit/runkit/express-endpoint/1.0.0')
const app = express(exports)
const Cite = require('citation-js@0.3.0')
const docs = 'https://gist.github.com/larsgw/ada240ded0a78d5a6ee2a864fbcb8640'

const validStyle = style => ['bibtxt', 'bibtex', 'csl', 'citation'].includes(style) || /^citation-\w+$/.test(style)
const validType = type => ['string', 'html', 'json'].includes(type)

const last = array => array[array.length - 1]

const getOptions = params => {
  const fragments = params.split('/')
  let data, style = 'csl', type = 'html'

  // parse fragments
  // (got pretty complex, as data can contain '/'s)

  return {data, style, type}
}

app
.get('/', (_, res) => res.redirect(docs))
.get('/*', ({params: {0: params}}, res) => {
  const {data, style, type} = getOptions(params)
  const cite = new Cite(data)
  const output = cite.get({style, type})
  res.send(output)
})

Full code here. Makes an API like this:

https://$CODE.runkit.sh/$DATA[/$STYLE[/$TYPE]]

Where

  • $CODE is the API id (vf2453q1d6s5 in this case),
  • $DATA is the input data (DOI, Wikidata ID, or even a BibTeX string),
  • $STYLE (optional) is the output style,
  • and $TYPE (optional) is the output type (basically plain text vs html).

This makes it possible to link to a lot of Citation.js outputs: