Syntaxus baccata: July 2016

Sunday, July 17, 2016

Weekly Report 3: Playing with NodeJS

I made a test JavaScript file for NodeJS to practice working with it, and prepare more for the project.

The result (code here) is a small program with some options:

Options:

  -h, --help     output usage information
  -V, --version  output the version number
  -o <path>      File to write data to
  -c <path>      File to print
  -t <string>    Text to print
  -d <string>    Data to write to file
  -v <string>    Verbosity: DEBUG, INFO, LOG, WARN or ERROR

I made a custom logger and a program to parse passed arguments. The option to parse with the commander module, as seen here:

var program = require(commander);

program.parse(process.argv);

which is used in the quickscrape code as well, didn't really work for me. Maybe I did something wrong. When I run the following code

var program = require('commander');

program
  .option('-t',
      'Test A')
  .option('-T',
      'Test B')
 .parse(process.argv);

console.log(program);

like this

node file.js -t 'Test_A' -T 'Test_B'

the console outputs this:

{ commands: [],
  options: 
  [ { flags: '-t',
required: 0,
optional: 0,
bool: true,
long: '-t',
description: 'Test A' },
    { flags: '-T',
  required: 0,
  optional: 0,
  bool: true,
  long: '-T',
  description: 'Test B' } ],
  _execs: {},
  _allowUnknownOption: false,
  _args: [],
  _name: 'file',
  Command: [Function: Command],
  Option: [Function: Option],
  _events: { '-t': [Function], '-T': [Function] },
  rawArgs: 
  [ 'node',
    '/home/larsw/public_html/cm-repo/js/file.js',
    '-t',
    'Test_A',
    '-T',
    'Test_B' ],
  T: true,
  args: [ 'Test_A', 'Test_B' ] }

This doesn't seem right to me. If is understand correctly, 'Test_A' is passed to -t and 'Test_B' to -T. Instead, commander seems to say -t and/or -T passed (T: true is present when you pass only -t too), and two strings are passed without context. Maybe there's something wrong with the program.option() things. Maybe my knowledge about command line arguments is incorrect, but I don't think to this extent. Just the same, I made a parser myself. Not sure if it follows any standard, but it works for me, for now. I will update the documentation shortly.

Next week I'll look into norma to see how it changes the format of papers from XML/HTML to sHTML, and ChemicalTagger to see how it recognises sentences.

Sunday, July 10, 2016

Weekly Report 2: Final project prototype

I updated Citation.js to accept quickscrape's results.json. Results can be viewed here.

I made a card prototype for the database HTML page, using data extracted with quickscrape. It's mainly, although not advanced, a design thing, but it's nice to see what information I can fit. Results are here.

With opened paper

Without opened paper

I have taken a look at quickscrape. What I'd like to do is to practise building a similar program myself, first just outputting strings to the command line, later processing arguments, to eventually use in my project.

I'll probably try to do that next week.

Sunday, July 3, 2016

Research project proposals

Extended proposal ( 3 June 2016, GitHub, with comments ): Analyzer and data visualizer for papers about conifers

Description
A program visualizing and listing data extracted with ContentMine software from a list of papers resulting from a query. Different types of data (authors, genes, molecules etc.) will be listed in different columns.

My proposal
My proposal is to collect data about conifers and visualize it in a dynamic HTML page. It consists of three parts.

The first part is to make and/or generate the RegEx after analyzing which sentences occur often. Examples of such sentences are <species> is <property> <value>, <species> has <property> <value> and After <process>, <species> displays <something>. Last example from PMC4351838 (Original sentence: 'Conversely, after girdling, Pinus canariensis displays an active growth from the upper edge, being often able to reconnect the phloem and surmount the injury if the removed ring is not too wide.').

Selecting sentence structures from the text will be done manually, as it would take way more time and work to write a program for this purpose. Then, actual expressions can be generated to match for certain species, by combining RegEx to match a certain name and RegEx for the found sentence structures.

See also the comment about the first part of the process in the more detailed description.

The second part is to write a program to convert all data to JSON format. The structure of the JSON will, if possible, follow certain guidelines. An example of a guideline is the structure Wikidata uses. If not possible, guidelines will be made. The data will periodically be converted by hand, and put on a server.

This will probably be done by a custom NodeJS program.

The third part is to use the data to make a database. This will be shown on a dynamic website, which will probably be made with HTML, CSS and jQuery or AngularJS and will be put online. Its main structure is explained here:

It will have a search function. The results will be a list of objects found in the database, and all related items.

The object will be an expandable card containing information like type of object (species, gene, etc.), graphs and properties found with the RegEx.

The information will link cards of the corresponding objects.

The following part is most of the extension.

A more detailed description of the process:

Input
Papers will be fetched manually with a getpapers query. This query will look for the occurrence of several conifer species in the abstract, to further ensure the paper is about the species, and not just mentioning it. The full text, if available, will be downloaded as XML, because this is easier to analyse. Example of the query:
getpapers -q 'ABSTRACT:"Pinus sylvestris"' -o Pinus_sylvestris -x
This downloads the XML files (-x) and metadata of papers with the word combination "Pinus sylvestris" in the abstract (-q 'ABSTRACT:"Pinus sylvestris"') into the output directory named Pinus_sylvestris (-o Pinus_sylvestris)

When the papers are downloaded they will be, again manually, formatted to sHMTL (scholarly html) with the norma command. Example of the command:
norma --project Pinus_sylvestris -i fulltext.xml -o scholarly.html --transform nlm2html
Then, the formatted HTML is analysed with the ami package, and the data is stored. For example, the binomial species are extracted like this:
ami2-species --project Pinus_sylvestris -i scholarly.html --sp.species --sp.type binomial

Not that interesting or special, so this bit was shortened in the normal proposal.

Further processing
The next step is to automatically find out the relation between the found data and the species. For this to work, you need RegEx, or an entire program, that can recognize often occurring sentences that says something about the species. For example, a sentence like this (from PMC4422480):
...it produces a number of important metabolites, e.g. flavonoids, anthocyanins, stilbenes, condensed tannins and phenolics.
can be analysed with RegEx. This uses sentences with certain structures to find properties of, in this case, Scots pine. Both the property name, and possibly the value, are stored, and used in further processing. Other examples of such sentences are <species> has <property>, <species> is <property> <value> and After <process>, <species> displays <something>. Last example from PMC4351838, original sentence: '… after girdling, Pinus canariensis displays an active growth from the upper edge ...'

Selecting sentence structures from the text will be done manually, as it would take way more time and work to write a program for this purpose. Then, actual expressions can be generated to match for certain species, by combining RegEx to match a certain name (for example P(inus|\.) sylvestris) and RegEx for the found sentence structures.

As I said here, RegEx is not the only option, but at the time it seemed the most viable one. However, since I read the papers about OSCAR, ChemicalTagger, and SyntaxNet, such programs seem to be a better option, as described in Weekly Report 4.

Output
The properties of the species and other objects found in the papers will be stored on a webserver as JSON. This way, a database for conifers can be generated which can be searched by users. Search results may include tables, graphs and Wikidata-like pages.

HTML page
The exact structure of the page is still to be discussed, but will probably be made with HTML, CSS and jQuery or AngularJS. It will fetch the JSON data from the server, and when the user searches, generate graphs and display data.

Example of a column, here with tweets.

A more recent design concept can be viewed in Weekly Report 2.

Outreach
I will blog regularly what I've accomplished and how. This will be done with reports, probably occurring every one or two weeks, containing examples and references to code. I'll use this GitHub repository to keep track of my own tasks and resulting code, and the wiki for reports.

And on the blog now as well.

Original proposal

This is the proposal I applied with.

Weekly Report 1: Start of my ContentMine fellowship

Start of the project. What I am going to do:

update Citation.js to accept quickscrape's results.json
look into quickscrape and the other programs, mainly to see how I can make a similar command for my own project
perhaps add/improve some scrapers
try to make a card prototype for the database

No real work yet, but some orientation and preparing until I have some more information.

Citation.js