Sunday, November 27, 2016

Weekly Report 12: Including table facts in Wikidata

This week, the big achievement is the addition of a multi-step form to add the semantic triples from last week to Wikidata with QuickStatements, which we talked about before too. The new '+' icon in table rows now links to a page where you can curate the statement and add Wikidata IDs where necessary. At the last step, you get a table of the existing data, the added identifiers and soon their Wikidata label. There, you can open the statement in QuickStatements, where it will be added to Wikidata. This is a delicate procedure; it's important to keep an eye on the validity of the statement.

The multi-step form in action (see tweet)

Take the following table:

Fungus Insect Host
Fungus A Insect 1 Host 1
Fungus B Insect 2 Host 2

This is currently interpreted (by the program) as "Fungus A has Insect 1 as Insect and Host 1 as Host", while the actual meaning is "Fungus A is spread by Insect Insect 1, of which the Host is Host 1" (see this table). While stating Host 1 is the Host of Fungus A is arguably correct, this occurs in much worse ways as well, and it's hard to account for it.

On top of that, often table values don't have the exact data, but abbreviations. It shouldn't be too hard to annotate abbreviations with their meaning before parsing the tables, but actually linking them to Wikidata Entities is still a problem. This is, obviously, the next step, and I'll continue to work on it the next few weeks. One thing I could do is search for entities based on their label. A problem with that is that I would have to make thousands of calls to the Wikidata API in a short period of time, so I'll probably create a dictionary with one, big call, if the resulting size allows it.

There is a pull request pending that would solve the issue norma has with tables, so I'll incorporate that into the program as well, making it easier for me as some things will already be annotated.

Conclusion: the 'curating' part of adding the statements to Wikidata is currently *very* important, but I'll do as much as possible to make it easier in the next few weeks.

Sunday, November 20, 2016

Citation.js Version 0.2.10: BibTeX and Travis

The new update (Citation.js v0.2.10) doesn't have a big impact on the API, but a lot has changed in the back end. ("back end" as in the helper functions that are called when the API is used.) First of all, there is Travis build tests now. It doesn't have edge test cases yet, but it covers the basics.

Testing the basic specs

The main thing is the new BibTeX I/O system. It isn't completely new, but a lot has changed. Before, BibTeX files were parsed with RegEx, you know, the thing you don't want to be parsing important things with. This worked fine for parsing the type of BibTeX I use in the test cases, but it failed in unexpected ways in case of a different syntax and it was a tedious piece of code as RegEx usually is.

Now I have a character-by-character parser (taking escaped character into consideration), that outputs clear error messages and recovers already parsed data when syntax errors occur. This parser converts BibTeX files into its JSON representation (which I call BibTeX-JSON), which then can be converted to CSL-JSON. There are some improvements here as well, but not as many.

The output is done by first converting CSL-JSON to BibTeX-JSON. This process is improved a bit as well. The BibTeX-JSON can then be converted to BibTeX. This now supports the escaping of syntax characters (e.g. '|' into '{\textbar}').

Sunday, November 13, 2016

Weekly Report 11: Making Object-Property-Value triples

In Weekly Report 10 I talked about searching for answers to the question "What height does a grown Pinus sylvestris normally have?". In the post, I looked at some of the articles returned by the query "Pinus sylvestris"[Abstract] AND height, and found interesting information in the tables. The next step was to extract this information.

So that's what I did. The first step was downloading fulltext XML files from the articles with getpapers. Secondly, I wanted to normalise the XML to sHTML with norma, but it currently has an issue with tables, so I resorted to parsing tables directly from the XML files. This wasn't really a downgrade. The main difference would be arbitrary XML tags instead of e.g. <i> and <sup>. Although it is more about layout, it still conveys meaning, e.g. the text content is probably a species name in the case of italics.

There are more problems, but I will go through these later. First off, how do I get the computer to gain information from tables? Well, generally, tables have the following structure:

Subject Type Property 1 Property 2
Object A Value A1 Value A2
Object B Value B1 Value B2

A small program can convert this to the following JSON:

[
  {
    "name": "Object A",
    "type": "Subject Type",
    "props": [
      {
        "name": "Property 1",
        "value": "Value A1"
      },
      {
        "name": "Property 2",
        "value": "Value A2"
      }
    ]
  },
  {
    "name": "Object B",
    "type": "Subject Type",
    "props": [
      {
        "name": "Property 1",
        "value": "Value B1"
      },
      {
        "name": "Property 2",
        "value": "Value B2"
      }
    ]
  }
]

This is pretty easy. Real problems occur when tables don't have use this layout. Some things can be accounted for. Take the example (see figure below) where tables don't have a head row. Because of this, I can't read the properties. The solution is to look in the previous table and copy the properties. Or when there are empty values in the left column. Usually, it implies the above row contains all necessary information, and it is done as a way to group table rows. Again, the solution is doing what the reader would do: look at the text in the cell above.

Example of split tables. From Ethnoveterinary medicines used for ruminants in British Columbia, Canada (DOI:10.1186/1746-4269-3-11)

But there are things that can't (easily) be accounted for as well. For example when the table is too long for the page, but narrow enough for two tables to fit. Instead of aligning two tables besides each other, they break the table apart and stick it back together, but in the wrong place. A very common problem is when they transpose the table. The properties are now in the left column, and the objects in the top row. These problems aren't that hard to fix; the hard thing is for the computer to find out what table layout is currently used.

JSON processing

But let's drop these issues for a moment. Assume enough data IS valid. We now have the JSON containing triples. I wanted to transform this to a format similar to ContentMine facts, as used by factvis. Because the ContentMine fact format is only used for identified terms for now, I had to come up with some new parts. It currently looks like this:

{
  "_id":"AAAAAAB6H6",
  "_source": {
    "term":"Acer macrophyllum",
    "exact":"Acer macrophyllum Pursh (Aceraceae) JB043",
    "prop": {
      "name":"Local name"
    },
    "value":"Big leaf maple",
    
    "identifiers": {
      "wikidata":"Q599523"
    },
    "documentID":"AAAAAAB6H5",
    "cprojectID":"PMC1831764"
  }
}

We have the fact ID, just as with ContentMine facts, and the _source object, containing the actual data. The fact ID is unique for every object, not every triple. This is to preserve context. The _source has a documentID and a cprojectID. Both say what article the triple was found in. exact has the exact text found in the cell, while term has the extracted plant name, if present. Otherwise, it contains the cell's contents. prop contains the text found in the property cell, and in some cases some other data as unit etc., if found. value contains the text in the value cell.

The property identifiers contains some IDs, if the term is identified. This is done by creating a 1.8-million large dictionary of species names linked to Wikidata IDs, directly from Wikidata, and matching these with terms. Props, values and other types of terms will hopefully have identifiers soon too.

Further processing

To quickly see the results, I made some changes to factvis to make it able to work for my new fact data. Here it is. It currently has three sample data sets, each with 5000 triples. I myself have ten of these sets, and I will release more when the data is more reliable; currently, it's just a demo.

Triple visualisation demo

However, we can already see the outlines of what is possible. Even if the data isn't going to be reliable, I can set up a service (with QuickStatements) for people to curate the data and add it to Wikidata with a few clicks. And if 50000 triples (some of which may be invalid) don't sound as enough to set up such a service, remember that they come from a simple query returning only 174 papers.

Friday, October 28, 2016

Citation.js Version 0.2: NPM package, CSL and more

In the last two weeks I've been busy with making Version 0.2 of Citation.js. Here I'll explain some of the changes and the reasoning behind them.

In the past months I've updated Citation.js several times, and the changes included a Node.js program for the commandline and better Wikidata input parsing. While I was working with the "old" code, I noticed some annoying issues in it.

One of the biggest things was the internal data format. When Cite(), the main program, parses input, makes it into JSON with a standardised scheme, which is used everywhere else in the program, e.g. for sorting and outputting. The scheme I used was something I made up to accommodate for the features it had back when there was next to no input parsing, and you were expected to input JSON directly, either by file or by webform. It wasn't scalable at all, and some of the methods were patched so much they only worked in specific test cases.

Old interface of Citation.js (pre-v0.1). It would fetch the first paragraph of Wikipedia about a certain formatting style. Many of the supporting methods of this version stayed in v0.1.

Now I use CSL-JSON, the scheme used by, among others, citeproc-js, and the standard described by the Citation Style Language. It is designed by professionals, or at least people more qualified to making standards. It is quite similar to my old scheme, with some exceptions. Big advantages are the way it stores date and person information. Before, I had to hope users provided names in the correct format. Now, it doesn't matter, as it gets parsed to CSL. The same goes for the date. Another advantage is the new output potential. Besides the output of CSL-JSON, it is now possible to use citeproc-js directly, without extra conversion.

Using a new data format also meant a lot of cleanup in the code. Almost all methods had to be re-written to account for it, but this was mostly a chance to write them more properly. Now, only Cite() exists in the global scope, which is good, because it means other parts don't take up variable names, etc. The entire program is now optimised for both browser and Node.js use, although it uses synchronous requests. From the perspective of the program it is necessary to be able to use synchronous requests. However, it is possible for users to bypass this for a big part. It is mostly used for >Wikidata parsing. An example:

Let's take the input "Q21972834". This is a Wikidata Entity ID, and it points to Assembling the 20 Gb white spruce (Picea glauca) genome from whole-genome shotgun sequencing data (10.1093/bioinformatics/btt178). If Cite() only has the ID, it has to fetch the corresponding data (JSON). Because Cite() is called as a function, and is expected to return something, it has to make the request synchronously. However, if the user fetches the data asynchronously and calls Cite() in the callback, that is bypassed.

var xhr = new XMLHttpRequest();

xhr.open(
  /* Method */ 'GET',
  /* URL    */ 'https://www.wikidata.org/wiki/Special:EntityData/Q21972834.json',
  /* Async  */ true
)

xhr.addEventListener( 'load', function () {
  var data = this.responseText
  
  var cite = new Cite( data )
  
  // Etc...
} );

xhr.send( null )

The JSON gets to Cite() with only async(hronous) requests. The problem is that this JSON doesn't contain everything. Instead of the name of the journal it's published in, it contains a numeric reference to it. To get that name, I have to make another request, which has to be sync as well. I hope there is some way in the Wikidata API to turn references off and names (or "labels") on, but I haven't found one yet. That being said, I had to search a long time to find the on-switch for cross-domain requests in the Wikidata API as well, so it might be hidden somewhere. If that's the case then sync requests can be bypassed everywhere, which would be nice, as browsers are on the verge of dropping support.

Probably the biggest news is that it is a NPM (Node Package Manager) package (link). This means you can download the code without having to clone the git repository, or copy the source file. It's even available online, although the default npm demo seems to be broken. Luckily, I have a regular demo as well. As of writing, the npm page says the package has been downloaded 234 times already, but that number has been the same for a day, so I guess there is an issue with npm. If not, that's really cool.

Sunday, October 9, 2016

Weekly Report 10: Visualising facts and asking questions

Earlier this week, tarrow published factvis, short for fact visualisation. I decided to have a go with the design, and I made this, in the style of cardlists. Note: If my version and tarrow's version of factvis look very similar, my changes are probably pushed to the master branch already.

Screenshot of my factvis design

The facts being visualised come from the ContentMine. It publishes facts about things related to zika, extracted from papers, on Zenodo. A fact has the following structure:

{
  "_index": "facts",
  "_type": "snippet",
  "_id": "AVdDntnH_8VqgcuJwvpW",
  "_score": 1,
  "_source": {
    "prefix": "icle-title>Mosquitos (Diptera: Culicidae) del ",
    "post": "</article-title><source>Entomol Vect</source><y",
    "term": "Uruguay",
    "documentID": "AVdDnq-oJ9hGurOzZIZE",
    "cprojectID": ["PMC4735964"],
    "identifiers": {
      "contentmine": "CM.wikidatacountry8",
      "wikidata": "Q77"
    }
  }
}

As you can see, it has a fact ID, and next to it the actual fact. The fact consists of the found term ("Uruguay"), the text before and after the term (prefix and post), the document it was found in, and identifiers, saying what the term actually means. The identifiers are a ContentMine ID and a Wikidata Entity ID.

That's all it is, for now. Still pretty cool, to distinguish special words and abbreviations from normal one, and linking them with established identifiers like those from Wikidata.

Conifers

The second topic today is asking biological questions about conifers. Now that I know most parts of the ContentMine pipeline with all its extensions, I can start to think of what I want to learn about conifers with it. The first questions are simple ones, or at least ones with simple facts as answers. Take "What height does a grown Pinus sylvestris normally have?". I know the answer is the value of the property height of the tree, and that the value is measured in some length unit.

Now all I have to do is search for the answer. Not that easy, but doable. First, I see if there actually are papers about the height of trees under normal conditions. So let's search EUPMC with the following query:

"Pinus sylvestris"[Abstract] AND height

With this, it searches for articles with the exact text "Pinus sylvestris" in the abstract, and with the word "height" anywhere in the article. The first found article is, at first sight, a bit unclear in wether it has an interesting answer, so let's move on to the second one. Remember, we are only taking a peek at what's inside. The second article however, looks more promising. The first table already contains exactly what we're looking for, and more than that. Apart from the height of Pinus sylvestris species it also has the diameter, and all this for two other conifers as well.

The same goes for the third article. While the first table hasn't got height data, it does have the diameter of several species in separated age groups, not to mention the properties I hadn't even thought of, like bark crevice depth, and canopy cover.

(I tweeted about the fourth one, as there were some funny stylesheet issues)

And if only three papers yield so much, imagine what can be done with more papers. The search I showed had 78 results, and when combined with searches for all the other species, there should be hundreds of articles having answers to just one, simple question. And with the ContentMine, I can "read" all those articles, and collect and summarise all these facts, in a matter of hours. Of course, I'll need to make some specialised programs to perform exactly what I want to do, so that's exactly what I'm going to do the next months.

Saturday, October 1, 2016

Parsing invalid JSON

When developing Citation.js, I needed to get a JavaScript Object from a string representing one. There are several methods for this, and the one that comes to mind first is JSON.parse(). However, this didn't work. Consider the following code:

{
  type: 'article',
  author: ['Chadwick D. Rittenhouse','Tony W. Mong','Thomas Hart'],
  editor: ['Stuart Pimm'],
  year: 2015,
  title: 'Weather conditions associated with autumn migration by mule deer in Wyoming',
  journal: 'PeerJ',
  volume: 3,
  pages: [1,21],
  doi: '10.7717/peerj.1045',
  publisher: 'PeerJ Inc.'
}

It contains data in the standard format Citation.js uses. And it's written in JavaScript, not JSON. Valid JSON requires all double quotes (") and property names wrapped with, again, double quotes. Valid JSON is valid in JavaScript too, but I prefer to write it like this. To accommodate to myself and other people preferring simple syntax, I had to come up with something else.

Option two is eval(), a code that parses JavaScript in strings and executes it on the fly. However, using eval is usually strongly discouraged for multiple reasons, one being code injection. Here are two strings. Both are valid JavaScript Object when pasted directly in the script. Only the second is valid JSON.

When the first gets processed by eval, it alerts a string (which may suppressed in the iframe above). Any code can be put where the alert() function is called. The first can't be processed by JSON.parse(), so we skip to processing the second with eval. This doesn't alert "Bar" as opposed to the first alerting "Foo". The second can get processed by JSON.parse(), and when it does it outputs the expected data. As you can see, only JSON.parse() never permits code injection, as it gives an error when it isn't valid JSON and valid JSON can't contain code.

Better use JSON.parse() then. But how are we going to parse invalid JSON without code injection? I hate to say it, but with regex. I know you shouldn't parse anything with regex, but I don't really parse it and when it fails, JSON.parse() will throw an error anyway. I use the following regex patterns (in this order):

  1. /((?:\[|:|,)\s*)'((?:\\'|[^'])*?[^\\])?'(?=\s*(?:\]|}|,))/g
    Changes single-quoted strings to double-quoted ones. Explanation and example on Regex101
  2. /((?:(?:"|]|}|\/[gmi]|\.|(?:\d|\.|-)*\d)\s*,|{)\s*)(?:"([^":\n]+?)"|'([^":\n]+?)'|([^":\n]+?))(\s*):/g
    Wraps property names in double quotes. Explanation and example on Regex101

As I said, this doesn't work perfectly, but it does the trick and it doesn't seem to be dangerous. When using this on the invalidString, it produces invalid JSON, the parser throws an error and the user is kindly asked to input valid JSON. But when using normal JavaScript, with somewhat normal string content, it works just fine. And you still can use normal JSON if you want, of course. It tries if that would work before using the regex, as you can see in the source code here:

case '{':case '[':
  // JSON string (probably)
  var obj;
  try       { obj = JSON.parse(data) }
  catch (e) {
    console.warn('Input was not valid JSON, switching to experimental parser for invalid JSON')
    try {
      obj = JSON.parse(data.replace(this._rgx.json[0],'$1"$2"').replace(this._rgx.json[1],'$1"$2$3$4"$5:'))
    } catch (e) {
      console.warn('Experimental parser failed. Please improve the JSON. If this is not JSON, please re-read the supported formats.')
    }
  }
  var res = new Cite(obj);
  inputFormat = 'string/' + res._input.format;
  formatData = res.data;
  break;

Monday, September 26, 2016

Citation.js on the Command Line

Last week I made Citation.js usable for the command line, with Node.js. First of all, the main file determines if it is being run with Node.js or in the browser, and switches to Browser or Nodejs mode based on that. This way, it knows when to use methods like document.createElement('p') and XMLHttpRequest and when not to.

Secondly, I created node.Citation.js. It is a file specialised for Node.js that loads Citation.js as a package and feeds it command line arguments. It, of course, also saves the output to some specified file. This way, you can easily process lots of files just once as well, instead of having to make some sort of I/O system in your browser. Now you can, for example, transform a list of Wikidata URLs or Q-numbers to a large BibTeX-file. Currently, the input list needs to be JSON (although a single URL will work), but that is being worked on. The parser for Wikidata properties (on my side) is being extended as well.

Overview of the use of node.Citation.js

That's about it. I'll add some better bypasses for browser-native methods in Nodejs mode and expand node.Citation.js to be able to process more input formats.

Monday, September 19, 2016

Weekly Report 9: More topics

This week I wanted to extend my program, with the lists of cards containing information. In the past few weeks, I made examples with topics such as conifers and zika. In the previous report I explained how I got the facts, and how other people could get them as well. But I felt it was a bit too complex, and mainly too messy. Executing nine previously unused commands while keeping an eye on the file paths, and change the scripts as well: not user-friendly.

So I made a GitHub repository ctj-cardlists, where people can submit topics with pull requests, and where I can improve the HTML, CSS and JavaScript without having to update every single HTML file. For getting facts I made a Bash script, and a guide on how to use it and how to submit a topic when you have the facts.

Cardlist page with the topic conifers (code co1)

I expect any interested person to be able to add a dataset to view with the page. Any part of the page is linkable, and it is a nice way to quickly glance at a lot of articles, and see common recurrences. Even if it currently mainly means "most misspelled plant names" (Pinus masson... uh... massonsana... no, I got it. Pinus massoniaia!). It will become more interesting with new columns like in the example below.

I'll improve the page in the next weeks, for example search auto-completion and shorter loading times, and in the end extra features like new columns about e.g. popular genus-word combinations. One problem is that you currently can't scroll down much anymore. This is because it shortens browser rendering times a LOT, and a fix on my side isn't in place yet. Search results and subpages focusing on a specific article/genus/species work as expected (actually better, because the shorter loading times allow more cards in general). I'll try to fix the scrolling problem soon.

Sunday, September 11, 2016

Weekly Report 8: Visualising Zika articles

Last week I wanted to look into extracting more facts, and the relation between found species and compounds. This would be done by extending ami. However, it became clear there will be big improvements to ami in the future, and things like ChemicalTagger and OSCAR are planned to be implemented anyway. It's better to wait for those things to complete before extending it for my own purposes.

Instead I improved the card page for future use. I didn't have too much time to do stuff this week, so I mainly wanted to demonstrate how you could use it with other data.

Article page

Here it is. It's very similar, of course. It has the same design, and comparable data structure. Word clouds now work. You can view them by opening an article and clicking on "Click to load word cloud". It uses a custom API using cloud.js (repo, license). It works by providing URL parameters file (URL of file with ctj output structure containing word data) and pmcid (PubMed Central ID).

I'll talk more about the process of getting data to display in a similar manner. Below is a command dump, but this doesn't cover custom programs. First you get papers, with getpapers. I used the query 'ABSTRACT:zika OR ABSTRACT:dengue OR ABSTRACT:spondweni'. There is nothing really special to this. ABSTRACT: helps with assuring the article remotely covers it, and the other parts are just topics. You can replace this to anything you want. You can use the limit 500 for now.

Then, you take it through the ContentMine pipeline (i.e. norma and ami). You use the ami plugins ami2-species, ami2-words and ami2-sequence. This gives a file system as output, which you can convert to JSON with ctj. Now you minify the file size by removing all data you don't use with c05.js, which I'll document later. The file paths are hard-coded but if you stick to the file structure I've used in the command dump it should work. Finally, you change the file paths in card_c05.html to what you want.

To make the wordcloud API working, you use c05-words.js. The file paths are hard-coded in this file as well, so look out for that. It may try to save a file in a directory that doesn't exist. I'll solve this sometime. Change the file path at line 208 to the output of c05-words.js, and you should be done... Note that you can't load files with a file:// protocol, so you may have to host it somewhere.

Commands used

Next week I'll probably add a better search function and similar things, and see if I can help with extending ami.

Sunday, September 4, 2016

Weekly Report 7: Interactive ContentMine output

This weekly report covers the past two weeks. I blogged twice last week, and I figured that was enough.

Last week I blogged about word clouds from ContentMine output. I also blogged about ctj. This week, I have combined both into interactive lists, as seen here and in the images below.

List overview. From left to right: articles, and genus/genera
and species that were mentioned in the articles.

Search results. Here one paper (doi:10.1186/1471-2164-13-589)

I made a NodeJS JSONtoJSON converter (here). It takes the ctj output, strips all information that I don't use, generates some lists, and outputs a minified JSON file. I load this in the HTML-file and generate the "cards". I'll probably move that process to a NodeJS file as well. This will cause a larger filesize, but hopefully a shorter loading time. I also need to make the scrolling more effecient; I don't need to load cards people don't view.

The "generate word cloud" button doesn't work yet, because it currently needs to load data from a file that's to big to put on GitHub efficiently. I'll fix this later.

In the next few weeks I'll fix the issues above and start to see how I can extract more "facts". Currently I only know where what is mentioned, where "what" is limited to species, genus, words, human genes, and regex matches. In the future I want to find metabolites, chemicals and the relation between these and conifer species.

Conifer taxonomy - Part 2

Continuation of this post.

I got an answer quite quickly (but after posting the previous post):

The Plant List marks what species are in what genus and family, and groups families in Major Groups, e.g. gymnosperms. It also marks synonyms. With a list of conifer species and the ContentMine output, I can determine which species are not conifers, and find how they interact with each other. Now I only have a list of species and genera, without any context.

The site does not really answer the question we asked in the previous post, of how families are grouped inside the gymnosperm group, but I seem to have been mistaken on that part anyway. The page about gymnosperms does have some different ideas over how families are divided, but the Pinophyta page does not state that it isn't part of the gymnosperms. It just says it with different words. It seems to say that it is part of one of gymnosperms and angiosperms, not the group containing both.

Pine (own photo)

Anyway, that is not really important. I now have the information I want, on how species are grouped in genera and families. That is what I need, at least for now. I don't have to focus on things outside the conifer families, so I do not think I really have to know how they are divided. This concludes our quest for data for now, see my other blog posts for progress with the project.

Monday, August 29, 2016

Word Clouds

Yesterday I published a blogpost, where I talked about ctj and how and why to convert ContentMine's CProjects to JSON. At the end, I mentioned this post, where I would talk about how to use it in different programs, and with d3.js. So here we go. For starters, let's make the data about word frequencies look nice. Not readable (then we would use a table), but visually pleasing. Let's make a word cloud. Skip to the part where I talk about converting the data.

Figure 1: Word Cloud (see text below)

Most Google results for "d3 js word cloud" point to cloud.js (repo, license). The problem was, I could not find the source code. Both index.js and build/d3.layout.cloud.js use require() in one of the first lines, and therefore I assumed it was intended for NodeJS.

Figure 2: Different font size scalings: log n, √n, n
(where n is the number of occurrences of a word)
So I started looking for a web version. I decided to copy cloud.min.js from the demo page, unminify it, and tried to customise it for my own use. Which was an huge pain. Almost all the variable names were shortened to single letters, and the whole code was a mess, due to the minifying process. Figuring out where to put parameters like formulas on how to scale font size (see figure 2), possible angles for words, etc. was the most troubling, as I wanted constants and the code took them from the form at the bottom of the demo page.

Here is the static result. For the live demo you need the input data and a localhost. Here is a guide on how to get the input data. To apply it, change the path on line 17 and change the PMCID on line 19 to the one you want to show. Of course, this needs to be one of an article that exists in your data. For jQuery to be able to fetch the JSON, you need a server-like thing, because local files fetching local files on the same location still counts as not the same domain.

Figure 3: See paragraph right
After a while I noticed build/d3.layout.cloud.js works in the web as well, so I used that. Here is the static result. As you can see I need to tweak a parameters a bit, which is more difficult than it seems. I'll optimise this and post about it later.

Now, the interesting part. When you finish making a design, you want to feed it words. We fetch the words from the output of ctj. I did it with jQuery as that is what I normally use. In the callback function, we get the word frequencies of a certain article (data.articles[ "PMC3830773" ].AMIResults.frequencies) and change the format to one cloud.js can handle easier. This can be anything, but you need to specify what e.g. here, and it is probably better to remove all data that will not be used. Then we add the words to the cloud (layout) and start generating the visual output with .start().
$.get( 'path/to/data.json', function( data, err ){
   
 var frequencies = data.articles[ "PMC3830773" ].AMIResults.frequencies
   , words       = frequencies.map( function ( v ) {
                     return { text: v.word, size: v.count }
                   } )
  
 layout.words(words).start();
  
});

Now that it is generated, we can see what words, stripped from common words such as prepositions and pronouns, are used most. The articles are about pines, so we see lots of words confirming that. "conifers", "wood", etc. We also notice some errors, like "./--" and "[].", not recognising punctuation ("wood", "Wood", "wood." and "wood,"), and CSS (?!): {background, #ffffff;} and px;}. These are all problems with ContentMine's ami2-word plugin and will be fixed. No worries.

More examples on how to use CProject data as JSON coming soon. Perhaps popular word combinations.


Example articles used:

  • Figure 1 and 2: Sequencing of the needle transcriptome from Norway spruce (Picea abies Karst L.) reveals lower substitution rates, but similar selective constraints in gymnosperms and angiosperms (10.1186/1471-2164-13-589)
  • Figure 3 and code block: The Transcriptomics of Secondary Growth and Wood Formation in Conifers (10.1155/2013/974324)

Sunday, August 28, 2016

CProject to JSON

I changed the "JSON maker" I used to convert CProjects to JSON last week to be useful for a more general use case, being more user-friendly and having more options, like grouping by species and minifying. It's now called ctj (CProject to JSON), although that name may be changed to something more clear or appropriate. The GitHub repository can be found here.

ctj

CProjects are the output of the ContentMine tools. The output is a directory of directories with XML files, JSON files, and a directory with more XML files, some of which may be empty. ctj converts all this into one JSON file, or several, if you want. Here is a guide on how to use it.

Because it is JSON now, it can be used easily different programs, or by websites. It is easy to use with e.g. d3.js, which I did, and will blog more about it soon. This will probably be the link.

Sunday, August 21, 2016

Weekly Report 6: ContentMine output to JSON to HTML

The "small program" proved more of a challenge than it seemed. Making a program to generate the JSON (link) was fairly easy. Loop through directories, find files, loop through files, collect XML data, save all collected data as JSON in a file. It took a while, but I think I spent the most time of it setting up the logistics, i.e. a nice logger, a file system reader and an argument processor.

The generated JSON was around 11 MB for 250 papers, so I didn't put it on GitHub, but it's fairly ease to reproduce. Here's a step-by-step guide. After you generate the data, put the JSON file and html/card_c03.html in your localhost (the html can't load the JSON if you don't) and open the latter in a browser, preferably Chrome/Chromium (I haven't tested it in other browsers). You may need to change the file path at line 459 to the place where you stored the JSON file, but this shouldn't be too much of a problem. Also, the content in the columns is capped to 50 items per column. You can change this for papers, genus and species respectively at lines 327, 369, and 419.

If you don't have the time to reproduce the data, here is a static demo (GUI under development). Click to expand the "cards" (items). The items are again capped at 50 per column. The papers are sorted on PMCID, the genus and species in order of appearance in the papers. The extra information at the bottom of the cards in the column of species and genus are in what papers they are mentioned, and how often. The info at the bottom of the article cards should be self-explanatory.

Current GUI

Finishing the GUI will take longer than making the JSON, mostly since CSS can be pretty annoying when you're trying to make nice things without too much JavaScript. I'll have to rethink the design of the cards because things don't fit now, a way to display the columns more nicely, and much more. All this might take a while, as there are lots of features I would like to try to implement.

The blogpost about Activation of defence pathways in Scots pine bark after feeding by pine weevil (Hylobius abietis) is postponed. I'll work on it when I'm done with the project above.

Friday, August 19, 2016

Conifer taxonomy

Recently, I tried to find out the exact taxonomy of conifers. I knew that a few years earlier, when I was actively working with it, there were a few issues on Wikipedia concerning the grouping of the main conifer families, namely Araucariaceae, Cephalotaxaceae, Cupressaceae, Pinaceae, Podocarpaceae, Sciadopityaceae, Taxaceae, and actually the grouping of genera in families as well. Guess what changed: not much, not on Wikipedia anyway. The disagreement between Wikipedia pages in different laguages is one thing, but the English pages were contradicting each other pretty heavily. Not even mentioning the position of gymnosperms in the plant hierarchy, which looks even worse. Some examples:

Subclass Cycadidae

  • Order Cycadales
    • Family Cycadaceae: Cycas
    • Family Zamiaceae: Dioon, Bowenia, Macrozamia, Lepidozamia, Encephalartos, Stangeria, Ceratozamia, Microcycas, Zamia.

Subclass Ginkgoidae

Subclass Gnetidae

Subclass Pinidae

  • Order Pinales
    • Family Pinaceae: Cedrus, Pinus, Cathaya, Picea, Pseudotsuga, Larix, Pseudolarix, Tsuga, Nothotsuga, Keteleeria, Abies
  • Order Araucariales
    • Family Araucariaceae: Araucaria, Wollemia, Agathis
    • Family Podocarpaceae: Phyllocladus, Lepidothamnus, Prumnopitys, Sundacarpus, Halocarpus, Parasitaxus, Lagarostrobos, Manoao, Saxegothaea, Microcachrys, Pherosphaera, Acmopyle, Dacrycarpus, Dacrydium, Falcatifolium, Retrophyllum, Nageia, Afrocarpus, Podocarpus
  • Order Cupressales
    • Family Sciadopityaceae: Sciadopitys
    • Family Cupressaceae: Cunninghamia, Taiwania, Athrotaxis, Metasequoia, Sequoia, Sequoiadendron, Cryptomeria, Glyptostrobus, Taxodium, Papuacedrus, Austrocedrus, Libocedrus, Pilgerodendron, Widdringtonia, Diselma, Fitzroya, Callitris (incl. Actinostrobus and Neocallitropsis), Thujopsis, Thuja, Fokienia, Chamaecyparis, Callitropsis, Cupressus, Juniperus, Xanthocyparis, Calocedrus, Tetraclinis, Platycladus, Microbiota
    • Family Taxaceae: Austrotaxus, Pseudotaxus, Taxus, Cephalotaxus, Amentotaxus, Torreya

Subclasses of the division gymnosperms, according to the Wikipedia page of gymnosperms

Ginkgo biloba leaves (source: inbetweenbays, CC-BY 4.0, via iNaturalist)

In the text above, we see three of the subclasses of gymnosperms contain "traditional" conifers, and three containing related species. A different page, however, talks about a division called Pinophyta, and a class called Pinopsida, where all "traditional" conifers are located. Recapping, the division gymnosperms contain conifers AND some other things like the Gingkgo and cycads. The division Pinophyta, contains ONLY conifers. This could be possible, if Pinophyta was a part of gymnosperms, but they're both divisions, so, as far as I know, they should be on the same taxon level. And the Wikipedia pages do not indicate anything else

Now, my knowledge of taxonomy may not be perfect, but this doesn't seem right. So, I tweeted Ross Mounce, who has been busy with making phylogenetic trees.

To be continued...

Thursday, August 18, 2016

Citation.js

Today I updated my GitHub repository of Citation.js. The main difference in the source files was the correction of a typo, but I updated the docs and restructured the directories as well. Not really important, but it is a nice opportunity to talk about it on my new blog (this one).

So, Citation.js is a JavaScript library to convert things like BibTeX, Wikidata JSON, and ContentMine JSON to a standardised format, and converting that standardised format to citations as in style guides like APA and Vancouver, and to BibTeX. And you know the part in Microsoft Word where you can fill in a form to get a reference list? It does that too (with the jQuery plugin).

Screenshots from the demo page

I made it so my classmates could enjoy the comfort of not having to bother with style guides, as the software does, while also being able to use a custom style, the one our school requires. In the end, I wanted to make this a public library, so I rewrote the whole program to be user-friendly (to programmers) and way more stable.

Fun things was, I made a bunch of other things in the process. A syntax highlighter for JavaScript and BibTeX, an alternative to eval for 'relaxed' JSON (JSON with syntax that works for JavaScript but not for JSON.parse()), a BibTeX parser and much more. These things made the project nice and interesting.

Wednesday, August 17, 2016

Weekly Report 5: This is what pine literature looks like

(yes, three days late)

This week I wanted to catch up on all the things that had happened while I was on holiday. I have finished my introductory blogpost on ContentMine's blog, and I made this blog and transferred all the weekly reports from the GitHub Wiki to here.

Next week will be more interesting. Firstly, I will publish a blogpost containing a short analysis about the article I mentioned in an earlier blog, namely Activation of defence pathways in Scots pine bark after feeding by pine weevil (Hylobius abietis). There I'll talk about this article as an example on how similar articles could be used to extract data, and create a network of facts.

Large Pine Weevil (Hylobius abietis), mentioned in the article (source: Stanislav Snäll, CC BY 3.0, via Wikimedia Commons)

Secondly, I will make a small program that visualises data outputted by the ContentMine pipeline. This will basically be a small version of the program to make the database, omitting the most important and difficult part, finding relations between facts. This is mainly to test the infrastructure and see what things need to be improved before starting to build the big program.

Friday, August 12, 2016

Introducing Fellow Lars Willighagen: Constructing and visualising a network-based conifer database

Originally posted on the ContentMine blog

I am Lars, and I am from the Netherlands, where I currently live. I applied to this fellowship to learn new things and combine the ContentMine with two previous projects I never got to finish, and I got really excited by the idea and the ContentMine at large.

A part of the project is a modification of a project I started a while ago, for visualising tweets with certain hashtags, and extracting data like links and ORCID‘s with JavaScript to display in feeds. It was a side project, so it died out after a while. It is also a continuation of an interest of mine, one in conifers. A few years ago, I tried to combine facts about conifers from several sources into a (LaTeX) book, but I quit this early as well.

Practically, it is about collecting data about conifers and visualise it in a dynamic HTML page. This is done in three parts. The first part is to fetch, normalise, index papers with the ContentMine tools, and automatically process it to find relations between data, probably by analysing sentences with tools such as (a modified) OSCAR, (a modified) ChemicalTagger, simply RegEx, or, if it proves necessary, a more advanced NLP-tool like SyntaxNet.

For example, an article that would work well here is Activation of defence pathways in Scots pine bark after feeding by pine weevil (Hylobius abietis). It shows a nice interaction network between Pinus sylvestris, Hylobius abietis and chemicals in the former related to attacks by the latter.

The second part is to write a program to convert all data to a standardised format. The third part is to use the data to make a database. Because the relation between found data is known, it will have a structure comparable to Wikidata and similar databases. This will be shown on a dynamic website, and when the data is reliable and the error rate is small enough, it may be exported to Wikidata.

ContentMine Fellowship

So, I am a ContentMine fellow now. Actually, for a few weeks already. That's why you see the weekly reports. Those are reports about the project I am doing for my fellowship. I don not have much else to say about it. For more information, please take a look at the links below.

Some links:

Sunday, August 7, 2016

Weekly Report 4: Learning about text mining

I was on holiday for the last two weeks, so I was not able to do all to much and I missed the second webinar, but I did read some on-topic papers about NLP. To find the occurrence of certain species and chemicals seems fairly easy to me, with the right dictionaries of course, but to find the relation between a species and a chemical in an unstructured sentence can't be done without understanding the sentence, and without a program to do this, you would have to extract the data manually. I first wanted to do this with RegExp, but now that I have read the OSCAR4 and ChemicalTagger papers, I know there are other, better options.

Especially OSCAR is said to be highly domain-independent. My conclusion is that this, and with this perhaps a modified version of ChemicalTagger can be used in my project to find the relation between occurring species, chemicals and diseases.

If this doesn't work, for some reason (i.e. it takes to long to build dictionaries, the grammar is to complex), there is always Google's SyntaxNet and it's English parser Parsey McParseface. Be as it may, this seems a bit over the top. They are made to figure out the grammar of less strict sentences, and therefore they have to use more complex systems, such as a "globally normalized transition-based neural network". This is needed to choose the best option in situations like the following.

One of the main problems that makes parsing so challenging is that human languages show remarkable levels of ambiguity. It is not uncommon for moderate length sentences - say 20 or 30 words in length - to have hundreds, thousands, or even tens of thousands of possible syntactic structures. A natural language parser must somehow search through all of these alternatives, and find the most plausible structure given the context. As a very simple example, the sentence Alice drove down the street in her car has at least two possible dependency parses:



The first corresponds to the (correct) interpretation where Alice is driving in her car; the second corresponds to the (absurd, but possible) interpretation where the street is located in her car. The ambiguity arises because the preposition in can either modify drove or street; this example is an instance of what is called prepositional phrase attachment ambiguity.

If it's necessary, however, it's always an option.

Sunday, July 17, 2016

Weekly Report 3: Playing with NodeJS

I made a test JavaScript file for NodeJS to practice working with it, and prepare more for the project.

The result (code here) is a small program with some options:


Options:

  -h, --help     output usage information
  -V, --version  output the version number
  -o <path>      File to write data to
  -c <path>      File to print
  -t <string>    Text to print
  -d <string>    Data to write to file
  -v <string>    Verbosity: DEBUG, INFO, LOG, WARN or ERROR
I made a custom logger and a program to parse passed arguments. The option to parse with the commander module, as seen here:
var program = require(commander);

program.parse(process.argv);
which is used in the quickscrape code as well, didn't really work for me. Maybe I did something wrong. When I run the following code
var program = require('commander');

program
  .option('-t',
      'Test A')
  .option('-T',
      'Test B')
 .parse(process.argv);

console.log(program);
like this
node file.js -t 'Test_A' -T 'Test_B'
the console outputs this:
{ commands: [],
  options: 
  [ { flags: '-t',
required: 0,
optional: 0,
bool: true,
long: '-t',
description: 'Test A' },
    { flags: '-T',
  required: 0,
  optional: 0,
  bool: true,
  long: '-T',
  description: 'Test B' } ],
  _execs: {},
  _allowUnknownOption: false,
  _args: [],
  _name: 'file',
  Command: [Function: Command],
  Option: [Function: Option],
  _events: { '-t': [Function], '-T': [Function] },
  rawArgs: 
  [ 'node',
    '/home/larsw/public_html/cm-repo/js/file.js',
    '-t',
    'Test_A',
    '-T',
    'Test_B' ],
  T: true,
  args: [ 'Test_A', 'Test_B' ] }
This doesn't seem right to me. If is understand correctly, 'Test_A' is passed to -t and 'Test_B' to -T. Instead, commander seems to say -t and/or -T passed (T: true is present when you pass only -t too), and two strings are passed without context. Maybe there's something wrong with the program.option() things. Maybe my knowledge about command line arguments is incorrect, but I don't think to this extent. Just the same, I made a parser myself. Not sure if it follows any standard, but it works for me, for now. I will update the documentation shortly.

Next week I'll look into norma to see how it changes the format of papers from XML/HTML to sHTML, and ChemicalTagger to see how it recognises sentences.