Syntaxus baccata: September 2017

Monday, September 4, 2017

ctj rdf: Relations Between Conifers Mentioned in Articles

Below is part two of a small series on ctj rdf, a new program I made to transform ContentMine CProjects into SPARQL-queryable Wikidata-linked rdf.

Here’s a more detailed example. We are using my dataset available at 10.5281/zenodo.845935. It was generated from 1000 articles that mention ‘Pinus’ somewhere. This one has 15326 statements, whereof 8875 (57.9%) can be mapped, taking ~50 seconds. Now, for example, we can list the top 100 most mentioned species. The top 20:

Species	Hits
Pinus sylvestris	248
Picea abies	177
Pinus taeda	138
Pinus pinaster	120
Pinus contorta	96
Arabidopsis thaliana	91
Picea glauca	77
Pinus radiata	77
Pinus massoniana	72
Pseudotsuga menziesii	65
Oryza sativa	56
Pinus halepensis	56
Pinus ponderosa	55
Pinus banksiana	53
Pinus koraiensis	53
Picea mariana	52
Pinus nigra	51
Pinus strobus	46
Quercus robur	45
Fagus sylvatica	45
…	…

The top of the list isn’t surprising: mostly pines, other conifers, other trees, Arabidopsis thaliana which I’ve seen represented in pine literature before, and Oryza sativa or rice, which I haven’t seen before in this context.

Note that only 248 of the 1000 articles mention the top Pinus species. This may be because the query getting the articles was quite broad. Also note that this doesn’t take into account how often an article mentions a species; a caveat of the current rdf output.

Going of this list, we can then look what non-tree or even non-plant species are named most often in conjunction with a given species, or, in this case, a genus. Top 20:

Species 1	Species 2	Co-occurences
Picea abies	Pinus sylvestris	98
Picea abies	Pinus taeda	56
Arabidopsis thaliana	Pinus taeda	47
Picea glauca	Pinus taeda	43
Arabidopsis thaliana	Oryza sativa	43
Pinus pinaster	Pinus taeda	41
Pinus pinaster	Pinus sylvestris	41
Picea abies	Picea glauca	41
Arabidopsis thaliana	Picea abies	37
Pinus contorta	Pinus sylvestris	36
Betula pendula	Pinus sylvestris	36
Pinus sylvestris	Pinus taeda	36
Pinus nigra	Pinus sylvestris	35
Pinus contorta	Pseudotsuga menziesii	32
Picea abies	Pinus contorta	31
Picea abies	Pinus pinaster	30
Arabidopsis thaliana	Physcomitrella patens	30
Oryza sativa	Pinus taeda	29
Pinus sylvestris	Quercus robur	29
Picea abies	Picea sitchensis	28
…	…	…

Interesting to see that rice is mostly mentioned with Arabidopsis. Let’s explore that further. Below are species named in conjunction with Oryza sativa.

Species	Co-occurrences
Arabidopsis thaliana	43
Pinus taeda	29
Picea abies	24
Physcomitrella patens	23
Populus trichocarpa	21
Glycine max	20
Vitis vinifera	17
Picea glauca	17
Pinus pinaster	16
Selaginella moellendorffii	15
Pinus sylvestris	13
Triticum aestivum	12
Pinus contorta	10
Picea sitchensis	10
Ginkgo biloba	10
Pinus radiata	10
Ricinus communis	9
Amborella trichopoda	9
Medicago truncatula	9
Cucumis sativus	8
…	…

So attention seems divided between trees and more agriculture-related plants. More to explore for later.

View all posts in this series.

Sunday, September 3, 2017

ctj rdf: Part One

Below is part one of a small series on ctj rdf, a new program I made to transform ContentMine CProjects into SPARQL-queryable Wikidata-linked rdf.

ctj has been around for longer, and started as a way to learn my way into the ContentMine pipeline of tools, but turned out to uncover a lot of possibilities in further processing the output of this pipeline (1, 2).

The recent addition of ctj rdf expands on this. While there is a lot of data loss between the ContentMine output and the resulting rdf, the possibilities certainly are no less. This is mainly because of SPARQL, which makes it possible to integrate in other databases, such as Wikidata, without many changes in ctj rdf itself.

Here’s a simple demonstration of how this works:

We download 100 articles about aardvark (classic ContentMine example)
We run the ContentMine pipeline (norma, ami2-species, ami2-sequence)
We run ctj rdf

This generates data.ttl, which holds the following information:

Common identifier for each article (currently PMC)
Matched terms from each article (which terms/names are found in which article)
Type of each term (genus, binomial, etc.)
Label of each term (matched text)

Example data.ttl contents

Note that there are no links to Wikidata whatsoever. When we list, for instance, how often each term is mentioned in an article (in the dataset), we only have string values, a identifiers.org URI and some custom namespace URIs.

However, in this format, we can easily use the information in these papers in conjunction with the enormous amount of data in Wikidata with Federated Queries.

To accomplish this we first link the identifier in our dataset to the ones in Wikidata; then we link the matched text of the term to the taxon name in species in Wikidata. This alone already gives us a set of semantic triples where both values in every triple are linked to values in the extensive community-driven database that is Wikidata.

Example query, counting how often each species is mentioned, and mapping them to Wikidata

Results of the above query

Now say we want to list the Swedish name of each of those species. Well, we can, because that info probably exists on Wikidata (see stats below). And if we can’t find something, remember that each of those Wikidata values are also linked to numerous other databases.

Again, this is without having to change anything in the rdf output (to be fair, I forgot to list an article identifier in the first version of the program, but that could/should have been anticipated). Not having to add this data to the output has the added benefit of not having to make and maintain local dictionaries and lists of this data.

Some stats:

Number of articles: 100 (for reference)
Number of ‘term found in article’ statements: 1964
Number of those statements that map to Wikidata: 1293 (65.3% of total)
Number of mapped statements with Swedish labels: 1056 (81.7% of mapped statements, 53.8% of total)
Average number of statements per article: 19.64, 12.93 mapped

Note that not all terms are actually valid. A lot of genus matches are actually just capitalised words, and a lot of common species names are abbreviated, e.g. to E. coli, making it impossible to unambiguously map to Wikidata or any other database. This could explain the difference between found ‘terms’ and mapped terms.

View all posts in this series.