Monday, August 29, 2016

Word Clouds

Yesterday I published a blogpost, where I talked about ctj and how and why to convert ContentMine's CProjects to JSON. At the end, I mentioned this post, where I would talk about how to use it in different programs, and with d3.js. So here we go. For starters, let's make the data about word frequencies look nice. Not readable (then we would use a table), but visually pleasing. Let's make a word cloud. Skip to the part where I talk about converting the data.

Figure 1: Word Cloud (see text below)

Most Google results for "d3 js word cloud" point to cloud.js (repo, license). The problem was, I could not find the source code. Both index.js and build/ use require() in one of the first lines, and therefore I assumed it was intended for NodeJS.

Figure 2: Different font size scalings: log n, √n, n
(where n is the number of occurrences of a word)
So I started looking for a web version. I decided to copy cloud.min.js from the demo page, unminify it, and tried to customise it for my own use. Which was an huge pain. Almost all the variable names were shortened to single letters, and the whole code was a mess, due to the minifying process. Figuring out where to put parameters like formulas on how to scale font size (see figure 2), possible angles for words, etc. was the most troubling, as I wanted constants and the code took them from the form at the bottom of the demo page.

Here is the static result. For the live demo you need the input data and a localhost. Here is a guide on how to get the input data. To apply it, change the path on line 17 and change the PMCID on line 19 to the one you want to show. Of course, this needs to be one of an article that exists in your data. For jQuery to be able to fetch the JSON, you need a server-like thing, because local files fetching local files on the same location still counts as not the same domain.

Figure 3: See paragraph right
After a while I noticed build/ works in the web as well, so I used that. Here is the static result. As you can see I need to tweak a parameters a bit, which is more difficult than it seems. I'll optimise this and post about it later.

Now, the interesting part. When you finish making a design, you want to feed it words. We fetch the words from the output of ctj. I did it with jQuery as that is what I normally use. In the callback function, we get the word frequencies of a certain article (data.articles[ "PMC3830773" ].AMIResults.frequencies) and change the format to one cloud.js can handle easier. This can be anything, but you need to specify what e.g. here, and it is probably better to remove all data that will not be used. Then we add the words to the cloud (layout) and start generating the visual output with .start().
$.get( 'path/to/data.json', function( data, err ){
 var frequencies = data.articles[ "PMC3830773" ].AMIResults.frequencies
   , words       = function ( v ) {
                     return { text: v.word, size: v.count }
                   } )

Now that it is generated, we can see what words, stripped from common words such as prepositions and pronouns, are used most. The articles are about pines, so we see lots of words confirming that. "conifers", "wood", etc. We also notice some errors, like "./--" and "[].", not recognising punctuation ("wood", "Wood", "wood." and "wood,"), and CSS (?!): {background, #ffffff;} and px;}. These are all problems with ContentMine's ami2-word plugin and will be fixed. No worries.

More examples on how to use CProject data as JSON coming soon. Perhaps popular word combinations.

Example articles used:

  • Figure 1 and 2: Sequencing of the needle transcriptome from Norway spruce (Picea abies Karst L.) reveals lower substitution rates, but similar selective constraints in gymnosperms and angiosperms (10.1186/1471-2164-13-589)
  • Figure 3 and code block: The Transcriptomics of Secondary Growth and Wood Formation in Conifers (10.1155/2013/974324)

Sunday, August 28, 2016

CProject to JSON

I changed the "JSON maker" I used to convert CProjects to JSON last week to be useful for a more general use case, being more user-friendly and having more options, like grouping by species and minifying. It's now called ctj (CProject to JSON), although that name may be changed to something more clear or appropriate. The GitHub repository can be found here.


CProjects are the output of the ContentMine tools. The output is a directory of directories with XML files, JSON files, and a directory with more XML files, some of which may be empty. ctj converts all this into one JSON file, or several, if you want. Here is a guide on how to use it.

Because it is JSON now, it can be used easily different programs, or by websites. It is easy to use with e.g. d3.js, which I did, and will blog more about it soon. This will probably be the link.

Sunday, August 21, 2016

Weekly Report 6: ContentMine output to JSON to HTML

The "small program" proved more of a challenge than it seemed. Making a program to generate the JSON (link) was fairly easy. Loop through directories, find files, loop through files, collect XML data, save all collected data as JSON in a file. It took a while, but I think I spent the most time of it setting up the logistics, i.e. a nice logger, a file system reader and an argument processor.

The generated JSON was around 11 MB for 250 papers, so I didn't put it on GitHub, but it's fairly ease to reproduce. Here's a step-by-step guide. After you generate the data, put the JSON file and html/card_c03.html in your localhost (the html can't load the JSON if you don't) and open the latter in a browser, preferably Chrome/Chromium (I haven't tested it in other browsers). You may need to change the file path at line 459 to the place where you stored the JSON file, but this shouldn't be too much of a problem. Also, the content in the columns is capped to 50 items per column. You can change this for papers, genus and species respectively at lines 327, 369, and 419.

If you don't have the time to reproduce the data, here is a static demo (GUI under development). Click to expand the "cards" (items). The items are again capped at 50 per column. The papers are sorted on PMCID, the genus and species in order of appearance in the papers. The extra information at the bottom of the cards in the column of species and genus are in what papers they are mentioned, and how often. The info at the bottom of the article cards should be self-explanatory.

Current GUI

Finishing the GUI will take longer than making the JSON, mostly since CSS can be pretty annoying when you're trying to make nice things without too much JavaScript. I'll have to rethink the design of the cards because things don't fit now, a way to display the columns more nicely, and much more. All this might take a while, as there are lots of features I would like to try to implement.

The blogpost about Activation of defence pathways in Scots pine bark after feeding by pine weevil (Hylobius abietis) is postponed. I'll work on it when I'm done with the project above.

Friday, August 19, 2016

Conifer taxonomy

Recently, I tried to find out the exact taxonomy of conifers. I knew that a few years earlier, when I was actively working with it, there were a few issues on Wikipedia concerning the grouping of the main conifer families, namely Araucariaceae, Cephalotaxaceae, Cupressaceae, Pinaceae, Podocarpaceae, Sciadopityaceae, Taxaceae, and actually the grouping of genera in families as well. Guess what changed: not much, not on Wikipedia anyway. The disagreement between Wikipedia pages in different laguages is one thing, but the English pages were contradicting each other pretty heavily. Not even mentioning the position of gymnosperms in the plant hierarchy, which looks even worse. Some examples:

Subclass Cycadidae

  • Order Cycadales
    • Family Cycadaceae: Cycas
    • Family Zamiaceae: Dioon, Bowenia, Macrozamia, Lepidozamia, Encephalartos, Stangeria, Ceratozamia, Microcycas, Zamia.

Subclass Ginkgoidae

Subclass Gnetidae

Subclass Pinidae

  • Order Pinales
    • Family Pinaceae: Cedrus, Pinus, Cathaya, Picea, Pseudotsuga, Larix, Pseudolarix, Tsuga, Nothotsuga, Keteleeria, Abies
  • Order Araucariales
    • Family Araucariaceae: Araucaria, Wollemia, Agathis
    • Family Podocarpaceae: Phyllocladus, Lepidothamnus, Prumnopitys, Sundacarpus, Halocarpus, Parasitaxus, Lagarostrobos, Manoao, Saxegothaea, Microcachrys, Pherosphaera, Acmopyle, Dacrycarpus, Dacrydium, Falcatifolium, Retrophyllum, Nageia, Afrocarpus, Podocarpus
  • Order Cupressales
    • Family Sciadopityaceae: Sciadopitys
    • Family Cupressaceae: Cunninghamia, Taiwania, Athrotaxis, Metasequoia, Sequoia, Sequoiadendron, Cryptomeria, Glyptostrobus, Taxodium, Papuacedrus, Austrocedrus, Libocedrus, Pilgerodendron, Widdringtonia, Diselma, Fitzroya, Callitris (incl. Actinostrobus and Neocallitropsis), Thujopsis, Thuja, Fokienia, Chamaecyparis, Callitropsis, Cupressus, Juniperus, Xanthocyparis, Calocedrus, Tetraclinis, Platycladus, Microbiota
    • Family Taxaceae: Austrotaxus, Pseudotaxus, Taxus, Cephalotaxus, Amentotaxus, Torreya

Subclasses of the division gymnosperms, according to the Wikipedia page of gymnosperms

Ginkgo biloba leaves (source)

In the text above, we see three of the subclasses of gymnosperms contain "traditional" conifers, and three containing related species. A different page, however, talks about a division called Pinophyta, and a class called Pinopsida, where all "traditional" conifers are located. Recapping, the division gymnosperms contain conifers AND some other things like the Gingkgo and cycads. The division Pinophyta, contains ONLY conifers. This could be possible, if Pinophyta was a part of gymnosperms, but they're both divisions, so, as far as I know, they should be on the same taxon level. And the Wikipedia pages do not indicate anything else

Now, my knowledge of taxonomy may not be perfect, but this doesn't seem right. So, I tweeted Ross Mounce, who has been busy with making phylogenetic trees.

To be continued...

Thursday, August 18, 2016


Today I updated my GitHub repository of Citation.js. The main difference in the source files was the correction of a typo, but I updated the docs and restructured the directories as well. Not really important, but it is a nice opportunity to talk about it on my new blog (this one).

So, Citation.js is a JavaScript library to convert things like BibTeX, Wikidata JSON, and ContentMine JSON to a standardised format, and converting that standardised format to citations as in style guides like APA and Vancouver, and to BibTeX. And you know the part in Microsoft Word where you can fill in a form to get a reference list? It does that too (with the jQuery plugin).

Screenshots from the demo page

I made it so my classmates could enjoy the comfort of not having to bother with style guides, as the software does, while also being able to use a custom style, the one our school requires. In the end, I wanted to make this a public library, so I rewrote the whole program to be user-friendly (to programmers) and way more stable.

Fun things was, I made a bunch of other things in the process. A syntax highlighter for JavaScript and BibTeX, an alternative to eval for 'relaxed' JSON (JSON with syntax that works for JavaScript but not for JSON.parse()), a BibTeX parser and much more. These things made the project nice and interesting.

Wednesday, August 17, 2016

Weekly Report 5: This is what pine literature looks like

(yes, three days late)

This week I wanted to catch up on all the things that had happened while I was on holiday. I have finished my introductory blogpost on ContentMine's blog, and I made this blog and transferred all the weekly reports from the GitHub Wiki to here.

Next week will be more interesting. Firstly, I will publish a blogpost containing a short analysis about the article I mentioned in an earlier blog, namely Activation of defence pathways in Scots pine bark after feeding by pine weevil (Hylobius abietis). There I'll talk about this article as an example on how similar articles could be used to extract data, and create a network of facts.

Large Pine Weevil (Hylobius abietis), mentioned in the article (source)

Secondly, I will make a small program that visualises data outputted by the ContentMine pipeline. This will basically be a small version of the program to make the database, omitting the most important and difficult part, finding relations between facts. This is mainly to test the infrastructure and see what things need to be improved before starting to build the big program.

Friday, August 12, 2016

Introducing Fellow Lars Willighagen: Constructing and visualising a network-based conifer database

Originally posted on the ContentMine blog

I am Lars, and I am from the Netherlands, where I currently live. I applied to this fellowship to learn new things and combine the ContentMine with two previous projects I never got to finish, and I got really excited by the idea and the ContentMine at large.

A part of the project is a modification of a project I started a while ago, for visualising tweets with certain hashtags, and extracting data like links and ORCID‘s with JavaScript to display in feeds. It was a side project, so it died out after a while. It is also a continuation of an interest of mine, one in conifers. A few years ago, I tried to combine facts about conifers from several sources into a (LaTeX) book, but I quit this early as well.

Practically, it is about collecting data about conifers and visualise it in a dynamic HTML page. This is done in three parts. The first part is to fetch, normalise, index papers with the ContentMine tools, and automatically process it to find relations between data, probably by analysing sentences with tools such as (a modified) OSCAR, (a modified) ChemicalTagger, simply RegEx, or, if it proves necessary, a more advanced NLP-tool like SyntaxNet.

For example, an article that would work well here is Activation of defence pathways in Scots pine bark after feeding by pine weevil (Hylobius abietis). It shows a nice interaction network between Pinus sylvestris, Hylobius abietis and chemicals in the former related to attacks by the latter.

The second part is to write a program to convert all data to a standardised format. The third part is to use the data to make a database. Because the relation between found data is known, it will have a structure comparable to Wikidata and similar databases. This will be shown on a dynamic website, and when the data is reliable and the error rate is small enough, it may be exported to Wikidata.

ContentMine Fellowship

So, I am a ContentMine fellow now. Actually, for a few weeks already. That's why you see the weekly reports. Those are reports about the project I am doing for my fellowship. I don not have much else to say about it. For more information, please take a look at the links below.

Some links:

Tuesday, August 9, 2016


Welcome to my new blog! Here, I will be blogging about my programming projects, ContentMine, photography and other things.

The reason this isn't the oldest post is that the others were taken from GitHub.

Links (see also "Other links" in the sidebar)

Sunday, August 7, 2016

Weekly Report 4: Learning about text mining

I was on holiday for the last two weeks, so I was not able to do all to much and I missed the second webinar, but I did read some on-topic papers about NLP. To find the occurrence of certain species and chemicals seems fairly easy to me, with the right dictionaries of course, but to find the relation between a species and a chemical in an unstructured sentence can't be done without understanding the sentence, and without a program to do this, you would have to extract the data manually. I first wanted to do this with RegExp, but now that I have read the OSCAR4 and ChemicalTagger papers, I know there are other, better options.

Especially OSCAR is said to be highly domain-independent. My conclusion is that this, and with this perhaps a modified version of ChemicalTagger can be used in my project to find the relation between occurring species, chemicals and diseases.

If this doesn't work, for some reason (i.e. it takes to long to build dictionaries, the grammar is to complex), there is always Google's SyntaxNet and it's English parser Parsey McParseface. Be as it may, this seems a bit over the top. They are made to figure out the grammar of less strict sentences, and therefore they have to use more complex systems, such as a "globally normalized transition-based neural network". This is needed to choose the best option in situations like the following.

One of the main problems that makes parsing so challenging is that human languages show remarkable levels of ambiguity. It is not uncommon for moderate length sentences - say 20 or 30 words in length - to have hundreds, thousands, or even tens of thousands of possible syntactic structures. A natural language parser must somehow search through all of these alternatives, and find the most plausible structure given the context. As a very simple example, the sentence Alice drove down the street in her car has at least two possible dependency parses:

The first corresponds to the (correct) interpretation where Alice is driving in her car; the second corresponds to the (absurd, but possible) interpretation where the street is located in her car. The ambiguity arises because the preposition in can either modify drove or street; this example is an instance of what is called prepositional phrase attachment ambiguity.

If it's necessary, however, it's always an option.