Syntaxus baccata: ContentMine

Showing posts with label ContentMine. Show all posts

Tuesday, July 17, 2018

Journal Metadata: Authors & Institutions

I finished the General Plugin system for Citation.js a few days ago (more on that later), so I could finally publish a new beta release. Now, after that half-finished piece of code had been blocking other work for a long while, I can at last start… fixing bugs, and closing other items in the backlog.

One of the items that has been on the backlog for a long time, and was on the backlog of the previous major version too, was sorting out BibJSON. BibJSON has been “supported” since before CSL-JSON was introduced as the internal standard, but under the name of ContentMine JSON, as I only knew it as the output of ContentMine’s quickscrape tool.

quickscrape output
quickscrape output (source, license: MIT)

Since then, I learned it actually was a more standardised format, but never got to the act of reading the standard and updating the parser. Today, however, I did. Turns out, it is something in between JSON-LD and BibTeX. While searching around for more comprehensive documentation, I saw the journal-scrapers (used by quickscrape) again, which I used to compile some test cases.

Unfortunately, one of the first examples went wrong already. The meta tags, containing the bibliographical data that quickscrape scrapes, specifically data pertaining to the authors, are not structured in a machine-friendly way, in my opinion. Certainly, quickscrape has trouble with it.

...
<meta name="citation_author" content="P. Pandikumar"/>
<meta name="citation_author_institution" content="Division of Ethnopharmacology, Entomology Research Institute, Loyola College, Chennai, India"/>
<meta name="citation_author" content="S. Ignacimuthu"/>
<meta name="citation_author_institution" content="Division of Ethnopharmacology, Entomology Research Institute, Loyola College, Chennai, India"/>
<meta name="citation_author_institution" content="International Scientific Partnership Programme, King Saud University, Riyadh, Saudi Arabia"/>
<meta name="citation_author" content="N. A. Al-Dhabi"/>
<meta name="citation_author_institution" content="Addiriyah Chair for Environmental Studies, College of Science, King Saud University, Riyadh, Saudi Arabia"/>
...

This particular example is from Biomed Central. However, the pattern persists throughout multiple journals: Nature (example), PLOS One (example), PeerJ (example), and probably many more, as these were just the first four I checked.

Prepend view-source: to those example URLs to quickly view the HTML source, with the meta tags.

The pattern is so similar, especially the authors always being after a whole list of citation_references in the case of Nature and BMC, that there must be some sort of library or service that generates these, I thought. This quest first led me to search what kind of tags citation_ are. The fact that the answer wasn’t very easy to find and the amount of unanswered questions I found along the way quickly made it clear what kind of quest this was going to be.

First of all, the tags: they’re called HighWire Press tags. Normally I would link a website, but I don’t think there is any. They’re the preferred method of metadata tagging of Google Scholar, which lists 16 tags, they’re also the preferred format of Mendeley, which points to the Google Scholar documentation, and yet the only thing I find searching for some canonical list is people asking where that list could be, and getting no answers (1, 2).

Even with the 16-tag list, I can find at least two tags, each non-trivial (e.g. citation_reference and citation_author_institution), in any of the examples mentioned above, that aren’t on that list. Not to mention that, again, those examples weren’t chosen, they were picked semi-randomly.

Luckily, I’m not the first one to run into this problem. Someone previously compiled a list of 39 citation_ tags based on observations, which is very useful if I want to write a crosswalk for Citation.js sometime, but doesn’t really help with finding a generator.

Back to HighWire: they claim Nature is one of their customers. BMC and Springer Open, however, aren’t, and yet they share the same system, or a common standard that can’t be found anywhere else. That they share a system makes sense, but what system and/or standard are they using? I asked, and will report back when I get an answer.

@BioMedCentral, @SpringerOpen: how do you generate metadata tags on your web pages? cc @samofthedamned https://t.co/EkjLVOf68t
— Lars Willighagen (@larswillighagen) July 14, 2018

Monday, September 4, 2017

ctj rdf: Relations Between Conifers Mentioned in Articles

Below is part two of a small series on ctj rdf, a new program I made to transform ContentMine CProjects into SPARQL-queryable Wikidata-linked rdf.

Here’s a more detailed example. We are using my dataset available at 10.5281/zenodo.845935. It was generated from 1000 articles that mention ‘Pinus’ somewhere. This one has 15326 statements, whereof 8875 (57.9%) can be mapped, taking ~50 seconds. Now, for example, we can list the top 100 most mentioned species. The top 20:

Species	Hits
Pinus sylvestris	248
Picea abies	177
Pinus taeda	138
Pinus pinaster	120
Pinus contorta	96
Arabidopsis thaliana	91
Picea glauca	77
Pinus radiata	77
Pinus massoniana	72
Pseudotsuga menziesii	65
Oryza sativa	56
Pinus halepensis	56
Pinus ponderosa	55
Pinus banksiana	53
Pinus koraiensis	53
Picea mariana	52
Pinus nigra	51
Pinus strobus	46
Quercus robur	45
Fagus sylvatica	45
…	…

The top of the list isn’t surprising: mostly pines, other conifers, other trees, Arabidopsis thaliana which I’ve seen represented in pine literature before, and Oryza sativa or rice, which I haven’t seen before in this context.

Note that only 248 of the 1000 articles mention the top Pinus species. This may be because the query getting the articles was quite broad. Also note that this doesn’t take into account how often an article mentions a species; a caveat of the current rdf output.

Going of this list, we can then look what non-tree or even non-plant species are named most often in conjunction with a given species, or, in this case, a genus. Top 20:

Species 1	Species 2	Co-occurences
Picea abies	Pinus sylvestris	98
Picea abies	Pinus taeda	56
Arabidopsis thaliana	Pinus taeda	47
Picea glauca	Pinus taeda	43
Arabidopsis thaliana	Oryza sativa	43
Pinus pinaster	Pinus taeda	41
Pinus pinaster	Pinus sylvestris	41
Picea abies	Picea glauca	41
Arabidopsis thaliana	Picea abies	37
Pinus contorta	Pinus sylvestris	36
Betula pendula	Pinus sylvestris	36
Pinus sylvestris	Pinus taeda	36
Pinus nigra	Pinus sylvestris	35
Pinus contorta	Pseudotsuga menziesii	32
Picea abies	Pinus contorta	31
Picea abies	Pinus pinaster	30
Arabidopsis thaliana	Physcomitrella patens	30
Oryza sativa	Pinus taeda	29
Pinus sylvestris	Quercus robur	29
Picea abies	Picea sitchensis	28
…	…	…

Interesting to see that rice is mostly mentioned with Arabidopsis. Let’s explore that further. Below are species named in conjunction with Oryza sativa.

Species	Co-occurrences
Arabidopsis thaliana	43
Pinus taeda	29
Picea abies	24
Physcomitrella patens	23
Populus trichocarpa	21
Glycine max	20
Vitis vinifera	17
Picea glauca	17
Pinus pinaster	16
Selaginella moellendorffii	15
Pinus sylvestris	13
Triticum aestivum	12
Pinus contorta	10
Picea sitchensis	10
Ginkgo biloba	10
Pinus radiata	10
Ricinus communis	9
Amborella trichopoda	9
Medicago truncatula	9
Cucumis sativus	8
…	…

So attention seems divided between trees and more agriculture-related plants. More to explore for later.

View all posts in this series.

Sunday, September 3, 2017

ctj rdf: Part One

Below is part one of a small series on ctj rdf, a new program I made to transform ContentMine CProjects into SPARQL-queryable Wikidata-linked rdf.

ctj has been around for longer, and started as a way to learn my way into the ContentMine pipeline of tools, but turned out to uncover a lot of possibilities in further processing the output of this pipeline (1, 2).

The recent addition of ctj rdf expands on this. While there is a lot of data loss between the ContentMine output and the resulting rdf, the possibilities certainly are no less. This is mainly because of SPARQL, which makes it possible to integrate in other databases, such as Wikidata, without many changes in ctj rdf itself.

Here’s a simple demonstration of how this works:

We download 100 articles about aardvark (classic ContentMine example)
We run the ContentMine pipeline (norma, ami2-species, ami2-sequence)
We run ctj rdf

This generates data.ttl, which holds the following information:

Common identifier for each article (currently PMC)
Matched terms from each article (which terms/names are found in which article)
Type of each term (genus, binomial, etc.)
Label of each term (matched text)

Example data.ttl contents

Note that there are no links to Wikidata whatsoever. When we list, for instance, how often each term is mentioned in an article (in the dataset), we only have string values, a identifiers.org URI and some custom namespace URIs.

However, in this format, we can easily use the information in these papers in conjunction with the enormous amount of data in Wikidata with Federated Queries.

To accomplish this we first link the identifier in our dataset to the ones in Wikidata; then we link the matched text of the term to the taxon name in species in Wikidata. This alone already gives us a set of semantic triples where both values in every triple are linked to values in the extensive community-driven database that is Wikidata.

Example query, counting how often each species is mentioned, and mapping them to Wikidata

Results of the above query

Now say we want to list the Swedish name of each of those species. Well, we can, because that info probably exists on Wikidata (see stats below). And if we can’t find something, remember that each of those Wikidata values are also linked to numerous other databases.

Again, this is without having to change anything in the rdf output (to be fair, I forgot to list an article identifier in the first version of the program, but that could/should have been anticipated). Not having to add this data to the output has the added benefit of not having to make and maintain local dictionaries and lists of this data.

Some stats:

Number of articles: 100 (for reference)
Number of ‘term found in article’ statements: 1964
Number of those statements that map to Wikidata: 1293 (65.3% of total)
Number of mapped statements with Swedish labels: 1056 (81.7% of mapped statements, 53.8% of total)
Average number of statements per article: 19.64, 12.93 mapped

Note that not all terms are actually valid. A lot of genus matches are actually just capitalised words, and a lot of common species names are abbreviated, e.g. to E. coli, making it impossible to unambiguously map to Wikidata or any other database. This could explain the difference between found ‘terms’ and mapped terms.

View all posts in this series.

Wednesday, April 26, 2017

Final Report: Analysing and visualising data from papers about conifers

Originally posted on the ContentMine blog.

Lars Willighagen, orcid:0000-0002-4751-4637

Final Report of my fellowship at the ContentMine.

Proposal

My proposal was to extract facts about various conifer species by analysing text from papers with software suited for analysing text and the tools provided by the ContentMine. These facts were then to be converted into JSON, and then viewable with an HTML (+CSS/JS) interface. Expected statements were like: 'Picea glauca is a species of the genus Picea', which could be parsed to the triple:Picea glauca; property:genus; subject:Picea.

Work

The main outcome of this project is a series of programmes converting tables from research articles into Wikidata statements. The workflow is as follows. First, papers matching a user-provided query are fetched by the ContentMine's getpapers. Second, the tables are extracted from the fetched papers and converted to assertions. This is done by filling empty cells in tables and then treating each row as an object, the first column being the name and the others property-value pairs. Different table designs are currently parsed in the same way, resulting in incorrect extraction of data, something that can be accommodated for by normalising the table structure beforehand. The resulting assertions are then converted to JSON, currently in a custom scheme, to allow the next steps.

Finally, the JSON assertions are visualized in an HTML GUI. This includes a stepper form (see picture) where you can curate the assertion, link identifiers, and add it to Wikidata.

Stepper form for curating assertions

Source code: https://github.com/larsgw/ctj-factvis
Demo: https://larsgw.github.io/ctj-factvis

Getting these assertions from text, as I proposed, was harder. Tools I expected to find included in ContentMine software were nowhere to be found, but were planned, so actually implementing them myself did not seem a good use of my time. Luckily, the literature corpus does not actually contain that many statements about physical properties of conifers in plain text as I originally expected: most are in tables, figures or in supplementary files, leading me to using those instead. The nice thing is that one of the main focuses of the ContentMine is parsing tables from PDF, so this will definitely be of general use.

Other work

During the project and to explore the design of the ContentMine, additional related components were developed:

ctj: program to convert and re-order AMI data to JSON, making it easier to read in JavaScript (mainly good for web applications);
ctj-cardlists: program to view AMI JSON (see above) in a Web GUI (demo); and
Citation.js: added functionality to parse BibJSON (used for quickscrape output) into CSL, for further formatting. See blog post.

These first two simplified handing AMI output in the browser, while the third makes it easier to display references in common formats.

Dissemination

All source code of the project outcomes is available on GitHub:

Progress was communicated during the project via the ContentMine Discourse page, on my personal blog (~20 posts), and on the general ContentMining blog (2 long posts).

Future work

The developed pipeline works but is not perfect.The pipeline to parse tables mentioned above requires further generalisation. This defines some logical next steps: fixes:

Finally adding it as an NPM module, making it (way) easier for people to use it;
Making searching easier in the HTML GUI (will need work further upstream too). Currently the list of assertions are split into pieces, making it hard to find anything. This can be fixed with a search index;
Normalising table structures to support more designs, rendering assertion extraction more reliable;
Making the process of curating assertions and linking identifiers easier by linking more identifiers, and showing context, i.e. the original tables; and
Some small performance and UX things.

Another important thing that is too big for a single bullet point, is annotating abbreviations and references in the document before extracting the tables. It's easier to curate statements like '[1] says this and this' when you know '[1]' references some known article. Another example: while a statement containing 'P. glauca' says nothing (there are 66+ species using that abbreviation), the article probably says which one it is somewhere outside the table, something that can be picked up if you annotate these before taking them out of context. This makes the interactive stepper form currently a necessity.

Evaluation

As noted, the work is far from done. Currently, it mainly shows a glimpse of what is possible had I spent more time on writing code. Short conclusions: CTJ is unpolished and slow. Because of a lack of customisation options, such as what data to use, you will almost always need to write custom code to not have to include tons of unnecessary data in your resulting JSON.

CTJ-Cardlists is actually pretty nice. It is slow, and it does not really show relations, but it does show an interesting overview of the literature corpus, like how often species are mentioned and with what they are mentioned together most of the time. You can easily draw reasonable conclusions like how often species names are misspelled. However, it would be more useful for this to have SQL queries or something similar. CTJ-Factvis shows even more potential, with the Wikidata integration. I do need to pay more attention to the fact that those assertions are alleged facts, and not regular ones, as I called them in earlier blog posts.

Fellowship

In general, the fellowship went pretty well for me. In retrospect, I did a lot of the things I wanted to do, even though that throughout the project it felt like there was so much left to do, and there is! I am really excited about the possibilities that emerged during the fellowship, and even in the last weeks. How cool would it be to extend this project with entire Web API's and more? This is, for a big part, thanks to the support, feedback, and input of the amazing ContentMine team during the regular meeting, and the quick responses to various software issues. I also enjoyed blogging about my progress on my own blog and on the ContentMine blog.

Sunday, March 26, 2017

Citation.js: BibJSON

Citation.js now supports BibJSON. How I did that without actually updating Citation.js? Well, apparently I supported it all along. I've supported the quickscrape output format since July last year, and that turned out to be BibJSON. How convenient. I'll update the demo and docs to reflect this revelation (currently it just says "quickscrape's JSON scheme"), and, now that I can find actual documentation, some improvements to the parser. It's a good candidate for a new output format too.

Some side notes on updates v0.3.0-0 to v0.3.0-2: these are prerelease updates, making it possible to use code before I have fixed all the issues and added all the features I promised for version 0.3. These updates fixed a lot of file organization problems; next updates will restructure the Cite object and fix tests.

Sunday, November 27, 2016

Weekly Report 12: Including table facts in Wikidata

This week, the big achievement is the addition of a multi-step form to add the semantic triples from last week to Wikidata with QuickStatements, which we talked about before too. The new '+' icon in table rows now links to a page where you can curate the statement and add Wikidata IDs where necessary. At the last step, you get a table of the existing data, the added identifiers and soon their Wikidata label. There, you can open the statement in QuickStatements, where it will be added to Wikidata. This is a delicate procedure; it's important to keep an eye on the validity of the statement.

The multi-step form in action (see tweet)

Take the following table:

Fungus	Insect	Host	…
Fungus A	Insect 1	Host 1	…
Fungus B	Insect 2	Host 2	…
…	…	…	…

This is currently interpreted (by the program) as "Fungus A has Insect 1 as Insect and Host 1 as Host", while the actual meaning is "Fungus A is spread by Insect Insect 1, of which the Host is Host 1" (see this table). While stating Host 1 is the Host of Fungus A is arguably correct, this occurs in much worse ways as well, and it's hard to account for it.

On top of that, often table values don't have the exact data, but abbreviations. It shouldn't be too hard to annotate abbreviations with their meaning before parsing the tables, but actually linking them to Wikidata Entities is still a problem. This is, obviously, the next step, and I'll continue to work on it the next few weeks. One thing I could do is search for entities based on their label. A problem with that is that I would have to make thousands of calls to the Wikidata API in a short period of time, so I'll probably create a dictionary with one, big call, if the resulting size allows it.

There is a pull request pending that would solve the issue norma has with tables, so I'll incorporate that into the program as well, making it easier for me as some things will already be annotated.

Conclusion: the 'curating' part of adding the statements to Wikidata is currently *very* important, but I'll do as much as possible to make it easier in the next few weeks.

Sunday, November 13, 2016

Weekly Report 11: Making Object-Property-Value triples

In Weekly Report 10 I talked about searching for answers to the question "What height does a grown Pinus sylvestris normally have?". In the post, I looked at some of the articles returned by the query "Pinus sylvestris"[Abstract] AND height, and found interesting information in the tables. The next step was to extract this information.

So that's what I did. The first step was downloading fulltext XML files from the articles with getpapers. Secondly, I wanted to normalise the XML to sHTML with norma, but it currently has an issue with tables, so I resorted to parsing tables directly from the XML files. This wasn't really a downgrade. The main difference would be arbitrary XML tags instead of e.g. <i> and <sup>. Although it is more about layout, it still conveys meaning, e.g. the text content is probably a species name in the case of italics.

There are more problems, but I will go through these later. First off, how do I get the computer to gain information from tables? Well, generally, tables have the following structure:

Subject Type	Property 1	Property 2	…
Object A	Value A1	Value A2	…
Object B	Value B1	Value B2	…
…	…	…	…

A small program can convert this to the following JSON:

[
  {
    "name": "Object A",
    "type": "Subject Type",
    "props": [
      {
        "name": "Property 1",
        "value": "Value A1"
      },
      {
        "name": "Property 2",
        "value": "Value A2"
      }
    ]
  },
  {
    "name": "Object B",
    "type": "Subject Type",
    "props": [
      {
        "name": "Property 1",
        "value": "Value B1"
      },
      {
        "name": "Property 2",
        "value": "Value B2"
      }
    ]
  }
]

This is pretty easy. Real problems occur when tables don't have use this layout. Some things can be accounted for. Take the example (see figure below) where tables don't have a head row. Because of this, I can't read the properties. The solution is to look in the previous table and copy the properties. Or when there are empty values in the left column. Usually, it implies the above row contains all necessary information, and it is done as a way to group table rows. Again, the solution is doing what the reader would do: look at the text in the cell above.

Example of split tables. From Ethnoveterinary medicines used for ruminants in British Columbia, Canada (DOI:10.1186/1746-4269-3-11)

But there are things that can't (easily) be accounted for as well. For example when the table is too long for the page, but narrow enough for two tables to fit. Instead of aligning two tables besides each other, they break the table apart and stick it back together, but in the wrong place. A very common problem is when they transpose the table. The properties are now in the left column, and the objects in the top row. These problems aren't that hard to fix; the hard thing is for the computer to find out what table layout is currently used.

JSON processing

But let's drop these issues for a moment. Assume enough data IS valid. We now have the JSON containing triples. I wanted to transform this to a format similar to ContentMine facts, as used by factvis. Because the ContentMine fact format is only used for identified terms for now, I had to come up with some new parts. It currently looks like this:

{
  "_id":"AAAAAAB6H6",
  "_source": {
    "term":"Acer macrophyllum",
    "exact":"Acer macrophyllum Pursh (Aceraceae) JB043",
    "prop": {
      "name":"Local name"
    },
    "value":"Big leaf maple",
    
    "identifiers": {
      "wikidata":"Q599523"
    },
    "documentID":"AAAAAAB6H5",
    "cprojectID":"PMC1831764"
  }
}

We have the fact ID, just as with ContentMine facts, and the _source object, containing the actual data. The fact ID is unique for every object, not every triple. This is to preserve context. The _source has a documentID and a cprojectID. Both say what article the triple was found in. exact has the exact text found in the cell, while term has the extracted plant name, if present. Otherwise, it contains the cell's contents. prop contains the text found in the property cell, and in some cases some other data as unit etc., if found. value contains the text in the value cell.

The property identifiers contains some IDs, if the term is identified. This is done by creating a 1.8-million large dictionary of species names linked to Wikidata IDs, directly from Wikidata, and matching these with terms. Props, values and other types of terms will hopefully have identifiers soon too.

Further processing

To quickly see the results, I made some changes to factvis to make it able to work for my new fact data. Here it is. It currently has three sample data sets, each with 5000 triples. I myself have ten of these sets, and I will release more when the data is more reliable; currently, it's just a demo.

Triple visualisation demo

However, we can already see the outlines of what is possible. Even if the data isn't going to be reliable, I can set up a service (with QuickStatements) for people to curate the data and add it to Wikidata with a few clicks. And if 50000 triples (some of which may be invalid) don't sound as enough to set up such a service, remember that they come from a simple query returning only 174 papers.

Sunday, October 9, 2016

Weekly Report 10: Visualising facts and asking questions

Earlier this week, tarrow published factvis, short for fact visualisation. I decided to have a go with the design, and I made this, in the style of cardlists. Note: If my version and tarrow's version of factvis look very similar, my changes are probably pushed to the master branch already.

Screenshot of my factvis design

The facts being visualised come from the ContentMine. It publishes facts about things related to zika, extracted from papers, on Zenodo. A fact has the following structure:

{
  "_index": "facts",
  "_type": "snippet",
  "_id": "AVdDntnH_8VqgcuJwvpW",
  "_score": 1,
  "_source": {
    "prefix": "icle-title>Mosquitos (Diptera: Culicidae) del ",
    "post": "</article-title><source>Entomol Vect</source><y",
    "term": "Uruguay",
    "documentID": "AVdDnq-oJ9hGurOzZIZE",
    "cprojectID": ["PMC4735964"],
    "identifiers": {
      "contentmine": "CM.wikidatacountry8",
      "wikidata": "Q77"
    }
  }
}

As you can see, it has a fact ID, and next to it the actual fact. The fact consists of the found term ("Uruguay"), the text before and after the term (prefix and post), the document it was found in, and identifiers, saying what the term actually means. The identifiers are a ContentMine ID and a Wikidata Entity ID.

That's all it is, for now. Still pretty cool, to distinguish special words and abbreviations from normal one, and linking them with established identifiers like those from Wikidata.

Conifers

The second topic today is asking biological questions about conifers. Now that I know most parts of the ContentMine pipeline with all its extensions, I can start to think of what I want to learn about conifers with it. The first questions are simple ones, or at least ones with simple facts as answers. Take "What height does a grown Pinus sylvestris normally have?". I know the answer is the value of the property height of the tree, and that the value is measured in some length unit.

Now all I have to do is search for the answer. Not that easy, but doable. First, I see if there actually are papers about the height of trees under normal conditions. So let's search EUPMC with the following query:

"Pinus sylvestris"[Abstract] AND height

With this, it searches for articles with the exact text "Pinus sylvestris" in the abstract, and with the word "height" anywhere in the article. The first found article is, at first sight, a bit unclear in wether it has an interesting answer, so let's move on to the second one. Remember, we are only taking a peek at what's inside. The second article however, looks more promising. The first table already contains exactly what we're looking for, and more than that. Apart from the height of Pinus sylvestris species it also has the diameter, and all this for two other conifers as well.

The same goes for the third article. While the first table hasn't got height data, it does have the diameter of several species in separated age groups, not to mention the properties I hadn't even thought of, like bark crevice depth, and canopy cover.

(I tweeted about the fourth one, as there were some funny stylesheet issues)

And if only three papers yield so much, imagine what can be done with more papers. The search I showed had 78 results, and when combined with searches for all the other species, there should be hundreds of articles having answers to just one, simple question. And with the ContentMine, I can "read" all those articles, and collect and summarise all these facts, in a matter of hours. Of course, I'll need to make some specialised programs to perform exactly what I want to do, so that's exactly what I'm going to do the next months.

Monday, September 19, 2016

Weekly Report 9: More topics

This week I wanted to extend my program, with the lists of cards containing information. In the past few weeks, I made examples with topics such as conifers and zika. In the previous report I explained how I got the facts, and how other people could get them as well. But I felt it was a bit too complex, and mainly too messy. Executing nine previously unused commands while keeping an eye on the file paths, and change the scripts as well: not user-friendly.

So I made a GitHub repository ctj-cardlists, where people can submit topics with pull requests, and where I can improve the HTML, CSS and JavaScript without having to update every single HTML file. For getting facts I made a Bash script, and a guide on how to use it and how to submit a topic when you have the facts.

Cardlist page with the topic conifers (code co1)

I expect any interested person to be able to add a dataset to view with the page. Any part of the page is linkable, and it is a nice way to quickly glance at a lot of articles, and see common recurrences. Even if it currently mainly means "most misspelled plant names" (Pinus masson... uh... massonsana... no, I got it. Pinus massoniaia!). It will become more interesting with new columns like in the example below.

I'll improve the page in the next weeks, for example search auto-completion and shorter loading times, and in the end extra features like new columns about e.g. popular genus-word combinations. One problem is that you currently can't scroll down much anymore. This is because it shortens browser rendering times a LOT, and a fix on my side isn't in place yet. Search results and subpages focusing on a specific article/genus/species work as expected (actually better, because the shorter loading times allow more cards in general). I'll try to fix the scrolling problem soon.

Sunday, September 11, 2016

Weekly Report 8: Visualising Zika articles

Last week I wanted to look into extracting more facts, and the relation between found species and compounds. This would be done by extending ami. However, it became clear there will be big improvements to ami in the future, and things like ChemicalTagger and OSCAR are planned to be implemented anyway. It's better to wait for those things to complete before extending it for my own purposes.

Instead I improved the card page for future use. I didn't have too much time to do stuff this week, so I mainly wanted to demonstrate how you could use it with other data.

Article page

Here it is. It's very similar, of course. It has the same design, and comparable data structure. Word clouds now work. You can view them by opening an article and clicking on "Click to load word cloud". It uses a custom API using cloud.js (repo, license). It works by providing URL parameters file (URL of file with ctj output structure containing word data) and pmcid (PubMed Central ID).

I'll talk more about the process of getting data to display in a similar manner. Below is a command dump, but this doesn't cover custom programs. First you get papers, with getpapers. I used the query 'ABSTRACT:zika OR ABSTRACT:dengue OR ABSTRACT:spondweni'. There is nothing really special to this. ABSTRACT: helps with assuring the article remotely covers it, and the other parts are just topics. You can replace this to anything you want. You can use the limit 500 for now.

Then, you take it through the ContentMine pipeline (i.e. norma and ami). You use the ami plugins ami2-species, ami2-words and ami2-sequence. This gives a file system as output, which you can convert to JSON with ctj. Now you minify the file size by removing all data you don't use with c05.js, which I'll document later. The file paths are hard-coded but if you stick to the file structure I've used in the command dump it should work. Finally, you change the file paths in card_c05.html to what you want.

To make the wordcloud API working, you use c05-words.js. The file paths are hard-coded in this file as well, so look out for that. It may try to save a file in a directory that doesn't exist. I'll solve this sometime. Change the file path at line 208 to the output of c05-words.js, and you should be done... Note that you can't load files with a file:// protocol, so you may have to host it somewhere.

Commands used

Next week I'll probably add a better search function and similar things, and see if I can help with extending ami.

Sunday, September 4, 2016

Weekly Report 7: Interactive ContentMine output

This weekly report covers the past two weeks. I blogged twice last week, and I figured that was enough.

Last week I blogged about word clouds from ContentMine output. I also blogged about ctj. This week, I have combined both into interactive lists, as seen here and in the images below.

List overview. From left to right: articles, and genus/genera
and species that were mentioned in the articles.

Search results. Here one paper (doi:10.1186/1471-2164-13-589)

I made a NodeJS JSONtoJSON converter (here). It takes the ctj output, strips all information that I don't use, generates some lists, and outputs a minified JSON file. I load this in the HTML-file and generate the "cards". I'll probably move that process to a NodeJS file as well. This will cause a larger filesize, but hopefully a shorter loading time. I also need to make the scrolling more effecient; I don't need to load cards people don't view.

The "generate word cloud" button doesn't work yet, because it currently needs to load data from a file that's to big to put on GitHub efficiently. I'll fix this later.

In the next few weeks I'll fix the issues above and start to see how I can extract more "facts". Currently I only know where what is mentioned, where "what" is limited to species, genus, words, human genes, and regex matches. In the future I want to find metabolites, chemicals and the relation between these and conifer species.

Conifer taxonomy - Part 2

Continuation of this post.

I got an answer quite quickly (but after posting the previous post):

@larswillighagen @Wikipedia try The Plant List. That's widely accepted: https://t.co/5J0AKPsCzD
— R⓪ss Mounce (@rmounce) 20 augustus 2016

The Plant List marks what species are in what genus and family, and groups families in Major Groups, e.g. gymnosperms. It also marks synonyms. With a list of conifer species and the ContentMine output, I can determine which species are not conifers, and find how they interact with each other. Now I only have a list of species and genera, without any context.

The site does not really answer the question we asked in the previous post, of how families are grouped inside the gymnosperm group, but I seem to have been mistaken on that part anyway. The page about gymnosperms does have some different ideas over how families are divided, but the Pinophyta page does not state that it isn't part of the gymnosperms. It just says it with different words. It seems to say that it is part of one of gymnosperms and angiosperms, not the group containing both.

Pine (own photo)

Anyway, that is not really important. I now have the information I want, on how species are grouped in genera and families. That is what I need, at least for now. I don't have to focus on things outside the conifer families, so I do not think I really have to know how they are divided. This concludes our quest for data for now, see my other blog posts for progress with the project.

Monday, August 29, 2016

Word Clouds

Yesterday I published a blogpost, where I talked about ctj and how and why to convert ContentMine's CProjects to JSON. At the end, I mentioned this post, where I would talk about how to use it in different programs, and with d3.js. So here we go. For starters, let's make the data about word frequencies look nice. Not readable (then we would use a table), but visually pleasing. Let's make a word cloud. Skip to the part where I talk about converting the data.

Figure 1: Word Cloud (see text below)

Most Google results for "d3 js word cloud" point to cloud.js (repo, license). The problem was, I could not find the source code. Both index.js and build/d3.layout.cloud.js use require() in one of the first lines, and therefore I assumed it was intended for NodeJS.

Figure 2: Different font size scalings: log n, √n, n
(where n is the number of occurrences of a word)

So I started looking for a web version. I decided to copy cloud.min.js from the demo page, unminify it, and tried to customise it for my own use. Which was an huge pain. Almost all the variable names were shortened to single letters, and the whole code was a mess, due to the minifying process. Figuring out where to put parameters like formulas on how to scale font size (see figure 2), possible angles for words, etc. was the most troubling, as I wanted constants and the code took them from the form at the bottom of the demo page.

Here is the static result. For the live demo you need the input data and a localhost. Here is a guide on how to get the input data. To apply it, change the path on line 17 and change the PMCID on line 19 to the one you want to show. Of course, this needs to be one of an article that exists in your data. For jQuery to be able to fetch the JSON, you need a server-like thing, because local files fetching local files on the same location still counts as not the same domain.

Figure 3: See paragraph right

After a while I noticed build/d3.layout.cloud.js works in the web as well, so I used that. Here is the static result. As you can see I need to tweak a parameters a bit, which is more difficult than it seems. I'll optimise this and post about it later.

Now, the interesting part. When you finish making a design, you want to feed it words. We fetch the words from the output of ctj. I did it with jQuery as that is what I normally use. In the callback function, we get the word frequencies of a certain article (data.articles[ "PMC3830773" ].AMIResults.frequencies) and change the format to one cloud.js can handle easier. This can be anything, but you need to specify what e.g. here, and it is probably better to remove all data that will not be used. Then we add the words to the cloud (layout) and start generating the visual output with .start().

$.get( 'path/to/data.json', function( data, err ){
   
 var frequencies = data.articles[ "PMC3830773" ].AMIResults.frequencies
   , words       = frequencies.map( function ( v ) {
                     return { text: v.word, size: v.count }
                   } )
  
 layout.words(words).start();
  
});

Now that it is generated, we can see what words, stripped from common words such as prepositions and pronouns, are used most. The articles are about pines, so we see lots of words confirming that. "conifers", "wood", etc. We also notice some errors, like "./--" and "[].", not recognising punctuation ("wood", "Wood", "wood." and "wood,"), and CSS (?!): {background, #ffffff;} and px;}. These are all problems with ContentMine's ami2-word plugin and will be fixed. No worries.

More examples on how to use CProject data as JSON coming soon. Perhaps popular word combinations.

Example articles used:

Figure 1 and 2: Sequencing of the needle transcriptome from Norway spruce (Picea abies Karst L.) reveals lower substitution rates, but similar selective constraints in gymnosperms and angiosperms (10.1186/1471-2164-13-589)
Figure 3 and code block: The Transcriptomics of Secondary Growth and Wood Formation in Conifers (10.1155/2013/974324)

Sunday, August 28, 2016

CProject to JSON

I changed the "JSON maker" I used to convert CProjects to JSON last week to be useful for a more general use case, being more user-friendly and having more options, like grouping by species and minifying. It's now called ctj (CProject to JSON), although that name may be changed to something more clear or appropriate. The GitHub repository can be found here.

ctj

CProjects are the output of the ContentMine tools. The output is a directory of directories with XML files, JSON files, and a directory with more XML files, some of which may be empty. ctj converts all this into one JSON file, or several, if you want. Here is a guide on how to use it.

Because it is JSON now, it can be used easily different programs, or by websites. It is easy to use with e.g. d3.js, which I did, and will blog more about it soon. This will probably be the link.

Sunday, August 21, 2016

Weekly Report 6: ContentMine output to JSON to HTML

The "small program" proved more of a challenge than it seemed. Making a program to generate the JSON (link) was fairly easy. Loop through directories, find files, loop through files, collect XML data, save all collected data as JSON in a file. It took a while, but I think I spent the most time of it setting up the logistics, i.e. a nice logger, a file system reader and an argument processor.

The generated JSON was around 11 MB for 250 papers, so I didn't put it on GitHub, but it's fairly ease to reproduce. Here's a step-by-step guide. After you generate the data, put the JSON file and html/card_c03.html in your localhost (the html can't load the JSON if you don't) and open the latter in a browser, preferably Chrome/Chromium (I haven't tested it in other browsers). You may need to change the file path at line 459 to the place where you stored the JSON file, but this shouldn't be too much of a problem. Also, the content in the columns is capped to 50 items per column. You can change this for papers, genus and species respectively at lines 327, 369, and 419.

If you don't have the time to reproduce the data, here is a static demo (GUI under development). Click to expand the "cards" (items). The items are again capped at 50 per column. The papers are sorted on PMCID, the genus and species in order of appearance in the papers. The extra information at the bottom of the cards in the column of species and genus are in what papers they are mentioned, and how often. The info at the bottom of the article cards should be self-explanatory.

Current GUI

Finishing the GUI will take longer than making the JSON, mostly since CSS can be pretty annoying when you're trying to make nice things without too much JavaScript. I'll have to rethink the design of the cards because things don't fit now, a way to display the columns more nicely, and much more. All this might take a while, as there are lots of features I would like to try to implement.

The blogpost about Activation of defence pathways in Scots pine bark after feeding by pine weevil (Hylobius abietis) is postponed. I'll work on it when I'm done with the project above.

Friday, August 19, 2016

Conifer taxonomy

Recently, I tried to find out the exact taxonomy of conifers. I knew that a few years earlier, when I was actively working with it, there were a few issues on Wikipedia concerning the grouping of the main conifer families, namely Araucariaceae, Cephalotaxaceae, Cupressaceae, Pinaceae, Podocarpaceae, Sciadopityaceae, Taxaceae, and actually the grouping of genera in families as well. Guess what changed: not much, not on Wikipedia anyway. The disagreement between Wikipedia pages in different laguages is one thing, but the English pages were contradicting each other pretty heavily. Not even mentioning the position of gymnosperms in the plant hierarchy, which looks even worse. Some examples:

Subclass Cycadidae

Order Cycadales

Family Cycadaceae: Cycas

Family Zamiaceae: Dioon, Bowenia, Macrozamia, Lepidozamia, Encephalartos, Stangeria, Ceratozamia, Microcycas, Zamia.

Subclass Ginkgoidae

Order Ginkgoales

Family Ginkgoaceae: Ginkgo

Subclass Gnetidae

Order Welwitschiales

Family Welwitschiaceae: Welwitschia

Order Gnetales

Family Gnetaceae: Gnetum

Order Ephedrales

Family Ephedraceae: Ephedra

Subclass Pinidae

Order Pinales

Family Pinaceae: Cedrus, Pinus, Cathaya, Picea, Pseudotsuga, Larix, Pseudolarix, Tsuga, Nothotsuga, Keteleeria, Abies

Order Araucariales

Family Araucariaceae: Araucaria, Wollemia, Agathis

Family Podocarpaceae: Phyllocladus, Lepidothamnus, Prumnopitys, Sundacarpus, Halocarpus, Parasitaxus, Lagarostrobos, Manoao, Saxegothaea, Microcachrys, Pherosphaera, Acmopyle, Dacrycarpus, Dacrydium, Falcatifolium, Retrophyllum, Nageia, Afrocarpus, Podocarpus

Order Cupressales

Family Sciadopityaceae: Sciadopitys

Family Cupressaceae: Cunninghamia, Taiwania, Athrotaxis, Metasequoia, Sequoia, Sequoiadendron, Cryptomeria, Glyptostrobus, Taxodium, Papuacedrus, Austrocedrus, Libocedrus, Pilgerodendron, Widdringtonia, Diselma, Fitzroya, Callitris (incl. Actinostrobus and Neocallitropsis), Thujopsis, Thuja, Fokienia, Chamaecyparis, Callitropsis, Cupressus, Juniperus, Xanthocyparis, Calocedrus, Tetraclinis, Platycladus, Microbiota

Family Taxaceae: Austrotaxus, Pseudotaxus, Taxus, Cephalotaxus, Amentotaxus, Torreya

Subclasses of the division gymnosperms, according to the Wikipedia page of gymnosperms

Ginkgo biloba leaves (source: inbetweenbays, CC-BY 4.0, via iNaturalist)

In the text above, we see three of the subclasses of gymnosperms contain "traditional" conifers, and three containing related species. A different page, however, talks about a division called Pinophyta, and a class called Pinopsida, where all "traditional" conifers are located. Recapping, the division gymnosperms contain conifers AND some other things like the Gingkgo and cycads. The division Pinophyta, contains ONLY conifers. This could be possible, if Pinophyta was a part of gymnosperms, but they're both divisions, so, as far as I know, they should be on the same taxon level. And the Wikipedia pages do not indicate anything else

Now, my knowledge of taxonomy may not be perfect, but this doesn't seem right. So, I tweeted Ross Mounce, who has been busy with making phylogenetic trees.

.@rmounce The English @Wikipedia is pretty ambiguous about conifer taxonomy. Where can I find the internationally most accepted standard?
— Lars Willighagen (@larswillighagen) 19 augustus 2016

To be continued...

Wednesday, August 17, 2016

Weekly Report 5: This is what pine literature looks like

^{(yes, three days late)}

This week I wanted to catch up on all the things that had happened while I was on holiday. I have finished my introductory blogpost on ContentMine's blog, and I made this blog and transferred all the weekly reports from the GitHub Wiki to here.

Next week will be more interesting. Firstly, I will publish a blogpost containing a short analysis about the article I mentioned in an earlier blog, namely Activation of defence pathways in Scots pine bark after feeding by pine weevil (Hylobius abietis). There I'll talk about this article as an example on how similar articles could be used to extract data, and create a network of facts.

Large Pine Weevil (Hylobius abietis), mentioned in the article (source: Stanislav Snäll, CC BY 3.0, via Wikimedia Commons)

Secondly, I will make a small program that visualises data outputted by the ContentMine pipeline. This will basically be a small version of the program to make the database, omitting the most important and difficult part, finding relations between facts. This is mainly to test the infrastructure and see what things need to be improved before starting to build the big program.