Monday, August 29, 2016

Word Clouds

Yesterday I published a blogpost, where I talked about ctj and how and why to convert ContentMine's CProjects to JSON. At the end, I mentioned this post, where I would talk about how to use it in different programs, and with d3.js. So here we go. For starters, let's make the data about word frequencies look nice. Not readable (then we would use a table), but visually pleasing. Let's make a word cloud. Skip to the part where I talk about converting the data.

Figure 1: Word Cloud (see text below)

Most Google results for "d3 js word cloud" point to cloud.js (repo, license). The problem was, I could not find the source code. Both index.js and build/ use require() in one of the first lines, and therefore I assumed it was intended for NodeJS.

Figure 2: Different font size scalings: log n, √n, n
(where n is the number of occurrences of a word)
So I started looking for a web version. I decided to copy cloud.min.js from the demo page, unminify it, and tried to customise it for my own use. Which was an huge pain. Almost all the variable names were shortened to single letters, and the whole code was a mess, due to the minifying process. Figuring out where to put parameters like formulas on how to scale font size (see figure 2), possible angles for words, etc. was the most troubling, as I wanted constants and the code took them from the form at the bottom of the demo page.

Here is the static result. For the live demo you need the input data and a localhost. Here is a guide on how to get the input data. To apply it, change the path on line 17 and change the PMCID on line 19 to the one you want to show. Of course, this needs to be one of an article that exists in your data. For jQuery to be able to fetch the JSON, you need a server-like thing, because local files fetching local files on the same location still counts as not the same domain.

Figure 3: See paragraph right
After a while I noticed build/ works in the web as well, so I used that. Here is the static result. As you can see I need to tweak a parameters a bit, which is more difficult than it seems. I'll optimise this and post about it later.

Now, the interesting part. When you finish making a design, you want to feed it words. We fetch the words from the output of ctj. I did it with jQuery as that is what I normally use. In the callback function, we get the word frequencies of a certain article (data.articles[ "PMC3830773" ].AMIResults.frequencies) and change the format to one cloud.js can handle easier. This can be anything, but you need to specify what e.g. here, and it is probably better to remove all data that will not be used. Then we add the words to the cloud (layout) and start generating the visual output with .start().
$.get( 'path/to/data.json', function( data, err ){
 var frequencies = data.articles[ "PMC3830773" ].AMIResults.frequencies
   , words       = function ( v ) {
                     return { text: v.word, size: v.count }
                   } )

Now that it is generated, we can see what words, stripped from common words such as prepositions and pronouns, are used most. The articles are about pines, so we see lots of words confirming that. "conifers", "wood", etc. We also notice some errors, like "./--" and "[].", not recognising punctuation ("wood", "Wood", "wood." and "wood,"), and CSS (?!): {background, #ffffff;} and px;}. These are all problems with ContentMine's ami2-word plugin and will be fixed. No worries.

More examples on how to use CProject data as JSON coming soon. Perhaps popular word combinations.

Example articles used:

  • Figure 1 and 2: Sequencing of the needle transcriptome from Norway spruce (Picea abies Karst L.) reveals lower substitution rates, but similar selective constraints in gymnosperms and angiosperms (10.1186/1471-2164-13-589)
  • Figure 3 and code block: The Transcriptomics of Secondary Growth and Wood Formation in Conifers (10.1155/2013/974324)

No comments:

Post a Comment