Below is part two of a small series on ctj rdf, a new program I made to transform ContentMine CProjects into SPARQL-queryable Wikidata-linked rdf.
Here’s a more detailed example. We are using my dataset available at 10.5281/zenodo.845935. It was generated from 1000 articles that mention ‘Pinus’ somewhere. This one has 15326 statements, whereof 8875 (57.9%) can be mapped, taking ~50 seconds. Now, for example, we can list the top 100 most mentioned species. The top 20:
Species | Hits |
---|---|
Pinus sylvestris | 248 |
Picea abies | 177 |
Pinus taeda | 138 |
Pinus pinaster | 120 |
Pinus contorta | 96 |
Arabidopsis thaliana | 91 |
Picea glauca | 77 |
Pinus radiata | 77 |
Pinus massoniana | 72 |
Pseudotsuga menziesii | 65 |
Oryza sativa | 56 |
Pinus halepensis | 56 |
Pinus ponderosa | 55 |
Pinus banksiana | 53 |
Pinus koraiensis | 53 |
Picea mariana | 52 |
Pinus nigra | 51 |
Pinus strobus | 46 |
Quercus robur | 45 |
Fagus sylvatica | 45 |
… | … |
The top of the list isn’t surprising: mostly pines, other conifers, other trees, Arabidopsis thaliana which I’ve seen represented in pine literature before, and Oryza sativa or rice, which I haven’t seen before in this context.
Note that only 248 of the 1000 articles mention the top Pinus species. This may be because the query getting the articles was quite broad. Also note that this doesn’t take into account how often an article mentions a species; a caveat of the current rdf output.
Going of this list, we can then look what non-tree or even non-plant species are named most often in conjunction with a given species, or, in this case, a genus. Top 20:
Species 1 | Species 2 | Co-occurences |
---|---|---|
Picea abies | Pinus sylvestris | 98 |
Picea abies | Pinus taeda | 56 |
Arabidopsis thaliana | Pinus taeda | 47 |
Picea glauca | Pinus taeda | 43 |
Arabidopsis thaliana | Oryza sativa | 43 |
Pinus pinaster | Pinus taeda | 41 |
Pinus pinaster | Pinus sylvestris | 41 |
Picea abies | Picea glauca | 41 |
Arabidopsis thaliana | Picea abies | 37 |
Pinus contorta | Pinus sylvestris | 36 |
Betula pendula | Pinus sylvestris | 36 |
Pinus sylvestris | Pinus taeda | 36 |
Pinus nigra | Pinus sylvestris | 35 |
Pinus contorta | Pseudotsuga menziesii | 32 |
Picea abies | Pinus contorta | 31 |
Picea abies | Pinus pinaster | 30 |
Arabidopsis thaliana | Physcomitrella patens | 30 |
Oryza sativa | Pinus taeda | 29 |
Pinus sylvestris | Quercus robur | 29 |
Picea abies | Picea sitchensis | 28 |
… | … | … |
Interesting to see that rice is mostly mentioned with Arabidopsis. Let’s explore that further. Below are species named in conjunction with Oryza sativa.
Species | Co-occurrences |
---|---|
Arabidopsis thaliana | 43 |
Pinus taeda | 29 |
Picea abies | 24 |
Physcomitrella patens | 23 |
Populus trichocarpa | 21 |
Glycine max | 20 |
Vitis vinifera | 17 |
Picea glauca | 17 |
Pinus pinaster | 16 |
Selaginella moellendorffii | 15 |
Pinus sylvestris | 13 |
Triticum aestivum | 12 |
Pinus contorta | 10 |
Picea sitchensis | 10 |
Ginkgo biloba | 10 |
Pinus radiata | 10 |
Ricinus communis | 9 |
Amborella trichopoda | 9 |
Medicago truncatula | 9 |
Cucumis sativus | 8 |
… | … |
So attention seems divided between trees and more agriculture-related plants. More to explore for later.
View all posts in this series.