Extended proposal ( 3 June 2016, GitHub, with comments ): Analyzer and data visualizer for papers about conifers
DescriptionA program visualizing and listing data extracted with ContentMine software from a list of papers resulting from a query. Different types of data (authors, genes, molecules etc.) will be listed in different columns.
My proposalMy proposal is to collect data about conifers and visualize it in a dynamic HTML page. It consists of three parts.
The first part is to make and/or generate the RegEx after analyzing which sentences occur often. Examples of such sentences are
<species> is <property> <value>,
<species> has <property> <value>and
After <process>, <species> displays <something>. Last example from PMC4351838 (Original sentence: 'Conversely, after girdling, Pinus canariensis displays an active growth from the upper edge, being often able to reconnect the phloem and surmount the injury if the removed ring is not too wide.').
Selecting sentence structures from the text will be done manually, as it would take way more time and work to write a program for this purpose. Then, actual expressions can be generated to match for certain species, by combining RegEx to match a certain name and RegEx for the found sentence structures.
See also the comment about the first part of the process in the more detailed description.
The second part is to write a program to convert all data to JSON format. The structure of the JSON will, if possible, follow certain guidelines. An example of a guideline is the structure Wikidata uses. If not possible, guidelines will be made. The data will periodically be converted by hand, and put on a server.
This will probably be done by a custom NodeJS program.
The third part is to use the data to make a database. This will be shown on a dynamic website, which will probably be made with HTML, CSS and jQuery or AngularJS and will be put online. Its main structure is explained here:
- It will have a search function. The results will be a list of objects found in the database, and all related items.
- The object will be an expandable card containing information like type of object (species, gene, etc.), graphs and properties found with the RegEx.
- The information will link cards of the corresponding objects.
The following part is most of the extension.
A more detailed description of the process:
InputPapers will be fetched manually with a
getpapersquery. This query will look for the occurrence of several conifer species in the abstract, to further ensure the paper is about the species, and not just mentioning it. The full text, if available, will be downloaded as XML, because this is easier to analyse. Example of the query:
getpapers -q 'ABSTRACT:"Pinus sylvestris"' -o Pinus_sylvestris -x
This downloads the XML files (
-x) and metadata of papers with the word combination "Pinus sylvestris" in the abstract (
-q 'ABSTRACT:"Pinus sylvestris"') into the output directory named Pinus_sylvestris (
When the papers are downloaded they will be, again manually, formatted to sHMTL (scholarly html) with the
normacommand. Example of the command:
norma --project Pinus_sylvestris -i fulltext.xml -o scholarly.html --transform nlm2html
Then, the formatted HTML is analysed with the
amipackage, and the data is stored. For example, the binomial species are extracted like this:
ami2-species --project Pinus_sylvestris -i scholarly.html --sp.species --sp.type binomial
Not that interesting or special, so this bit was shortened in the normal proposal.
Further processingThe next step is to automatically find out the relation between the found data and the species. For this to work, you need RegEx, or an entire program, that can recognize often occurring sentences that says something about the species. For example, a sentence like this (from PMC4422480):
...it produces a number of important metabolites, e.g. flavonoids, anthocyanins, stilbenes, condensed tannins and phenolics.
can be analysed with RegEx. This uses sentences with certain structures to find properties of, in this case, Scots pine. Both the property name, and possibly the value, are stored, and used in further processing. Other examples of such sentences are
<species> has <property>,
<species> is <property> <value>and
After <process>, <species> displays <something>. Last example from PMC4351838, original sentence: '… after girdling, Pinus canariensis displays an active growth from the upper edge ...'
Selecting sentence structures from the text will be done manually, as it would take way more time and work to write a program for this purpose. Then, actual expressions can be generated to match for certain species, by combining RegEx to match a certain name (for example
P(inus|\.) sylvestris) and RegEx for the found sentence structures.
As I said here, RegEx is not the only option, but at the time it seemed the most viable one. However, since I read the papers about OSCAR, ChemicalTagger, and SyntaxNet, such programs seem to be a better option, as described in Weekly Report 4.
OutputThe properties of the species and other objects found in the papers will be stored on a webserver as JSON. This way, a database for conifers can be generated which can be searched by users. Search results may include tables, graphs and Wikidata-like pages.
HTML pageThe exact structure of the page is still to be discussed, but will probably be made with HTML, CSS and jQuery or AngularJS. It will fetch the JSON data from the server, and when the user searches, generate graphs and display data.
Example of a column, here with tweets.
A more recent design concept can be viewed in Weekly Report 2.
OutreachI will blog regularly what I've accomplished and how. This will be done with reports, probably occurring every one or two weeks, containing examples and references to code. I'll use this GitHub repository to keep track of my own tasks and resulting code, and the wiki for reports.
And on the blog now as well.
This is the proposal I applied with.