In Weekly Report 10 I talked about searching for answers to the question "What height does a grown Pinus sylvestris normally have?". In the post, I looked at some of the articles returned by the query "Pinus sylvestris"[Abstract] AND height
, and found interesting information in the tables. The next step was to extract this information.
So that's what I did. The first step was downloading fulltext XML files from the articles with getpapers
. Secondly, I wanted to normalise the XML to sHTML with norma
, but it currently has an issue with tables, so I resorted to parsing tables directly from the XML files. This wasn't really a downgrade. The main difference would be arbitrary XML tags instead of e.g. <i>
and <sup>
. Although it is more about layout, it still conveys meaning, e.g. the text content is probably a species name in the case of italics.
There are more problems, but I will go through these later. First off, how do I get the computer to gain information from tables? Well, generally, tables have the following structure:
Subject Type | Property 1 | Property 2 | … |
Object A | Value A1 | Value A2 | … |
Object B | Value B1 | Value B2 | … |
… | … | … | … |
A small program can convert this to the following JSON:
[ { "name": "Object A", "type": "Subject Type", "props": [ { "name": "Property 1", "value": "Value A1" }, { "name": "Property 2", "value": "Value A2" } ] }, { "name": "Object B", "type": "Subject Type", "props": [ { "name": "Property 1", "value": "Value B1" }, { "name": "Property 2", "value": "Value B2" } ] } ]
This is pretty easy. Real problems occur when tables don't have use this layout. Some things can be accounted for. Take the example (see figure below) where tables don't have a head row. Because of this, I can't read the properties. The solution is to look in the previous table and copy the properties. Or when there are empty values in the left column. Usually, it implies the above row contains all necessary information, and it is done as a way to group table rows. Again, the solution is doing what the reader would do: look at the text in the cell above.
Example of split tables. From Ethnoveterinary medicines used for ruminants in British Columbia, Canada (DOI:10.1186/1746-4269-3-11) |
But there are things that can't (easily) be accounted for as well. For example when the table is too long for the page, but narrow enough for two tables to fit. Instead of aligning two tables besides each other, they break the table apart and stick it back together, but in the wrong place. A very common problem is when they transpose the table. The properties are now in the left column, and the objects in the top row. These problems aren't that hard to fix; the hard thing is for the computer to find out what table layout is currently used.
JSON processing
But let's drop these issues for a moment. Assume enough data IS valid. We now have the JSON containing triples. I wanted to transform this to a format similar to ContentMine facts, as used by factvis. Because the ContentMine fact format is only used for identified terms for now, I had to come up with some new parts. It currently looks like this:
{ "_id":"AAAAAAB6H6", "_source": { "term":"Acer macrophyllum", "exact":"Acer macrophyllum Pursh (Aceraceae) JB043", "prop": { "name":"Local name" }, "value":"Big leaf maple", "identifiers": { "wikidata":"Q599523" }, "documentID":"AAAAAAB6H5", "cprojectID":"PMC1831764" } }
We have the fact ID, just as with ContentMine facts, and the _source
object, containing the actual data. The fact ID is unique for every object, not every triple. This is to preserve context. The _source
has a documentID
and a cprojectID
. Both say what article the triple was found in. exact
has the exact text found in the cell, while term
has the extracted plant name, if present. Otherwise, it contains the cell's contents. prop
contains the text found in the property cell, and in some cases some other data as unit etc., if found. value
contains the text in the value cell.
The property identifiers
contains some IDs, if the term is identified. This is done by creating a 1.8-million large dictionary of species names linked to Wikidata IDs, directly from Wikidata, and matching these with terms. Props, values and other types of terms will hopefully have identifiers soon too.
Further processing
To quickly see the results, I made some changes to factvis to make it able to work for my new fact data. Here it is. It currently has three sample data sets, each with 5000 triples. I myself have ten of these sets, and I will release more when the data is more reliable; currently, it's just a demo.
Triple visualisation demo |
However, we can already see the outlines of what is possible. Even if the data isn't going to be reliable, I can set up a service (with QuickStatements) for people to curate the data and add it to Wikidata with a few clicks. And if 50000 triples (some of which may be invalid) don't sound as enough to set up such a service, remember that they come from a simple query returning only 174 papers.
No comments:
Post a Comment