Originally posted on the ContentMine blog
Lars, and I am from the Netherlands, where I currently live. I applied to this fellowship to learn new things and combine the ContentMine with two previous projects I never got to finish, and I got really excited by the idea and the ContentMine at large.
Practically, it is about collecting data about conifers and visualise it in a dynamic HTML page. This is done in three parts. The first part is to fetch, normalise, index papers with the ContentMine tools, and automatically process it to find relations between data, probably by analysing sentences with tools such as (a modified) OSCAR, (a modified) ChemicalTagger, simply RegEx, or, if it proves necessary, a more advanced NLP-tool like SyntaxNet.
For example, an article that would work well here is Activation of defence pathways in Scots pine bark after feeding by pine weevil (Hylobius abietis). It shows a nice interaction network between Pinus sylvestris, Hylobius abietis and chemicals in the former related to attacks by the latter.
The second part is to write a program to convert all data to a standardised format. The third part is to use the data to make a database. Because the relation between found data is known, it will have a structure comparable to Wikidata and similar databases. This will be shown on a dynamic website, and when the data is reliable and the error rate is small enough, it may be exported to Wikidata.