Monday, September 26, 2016

Citation.js on the Command Line

Last week I made Citation.js usable for the command line, with Node.js. First of all, the main file determines if it is being run with Node.js or in the browser, and switches to Browser or Nodejs mode based on that. This way, it knows when to use methods like document.createElement('p') and XMLHttpRequest and when not to.

Secondly, I created node.Citation.js. It is a file specialised for Node.js that loads Citation.js as a package and feeds it command line arguments. It, of course, also saves the output to some specified file. This way, you can easily process lots of files just once as well, instead of having to make some sort of I/O system in your browser. Now you can, for example, transform a list of Wikidata URLs or Q-numbers to a large BibTeX-file. Currently, the input list needs to be JSON (although a single URL will work), but that is being worked on. The parser for Wikidata properties (on my side) is being extended as well.

Overview of the use of node.Citation.js

That's about it. I'll add some better bypasses for browser-native methods in Nodejs mode and expand node.Citation.js to be able to process more input formats.

Monday, September 19, 2016

Weekly Report 9: More topics

This week I wanted to extend my program, with the lists of cards containing information. In the past few weeks, I made examples with topics such as conifers and zika. In the previous report I explained how I got the facts, and how other people could get them as well. But I felt it was a bit too complex, and mainly too messy. Executing nine previously unused commands while keeping an eye on the file paths, and change the scripts as well: not user-friendly.

So I made a GitHub repository ctj-cardlists, where people can submit topics with pull requests, and where I can improve the HTML, CSS and JavaScript without having to update every single HTML file. For getting facts I made a Bash script, and a guide on how to use it and how to submit a topic when you have the facts.

Cardlist page with the topic conifers (code co1)

I expect any interested person to be able to add a dataset to view with the page. Any part of the page is linkable, and it is a nice way to quickly glance at a lot of articles, and see common recurrences. Even if it currently mainly means "most misspelled plant names" (Pinus masson... uh... massonsana... no, I got it. Pinus massoniaia!). It will become more interesting with new columns like in the example below.

I'll improve the page in the next weeks, for example search auto-completion and shorter loading times, and in the end extra features like new columns about e.g. popular genus-word combinations. One problem is that you currently can't scroll down much anymore. This is because it shortens browser rendering times a LOT, and a fix on my side isn't in place yet. Search results and subpages focusing on a specific article/genus/species work as expected (actually better, because the shorter loading times allow more cards in general). I'll try to fix the scrolling problem soon.

Sunday, September 11, 2016

Weekly Report 8: Visualising Zika articles

Last week I wanted to look into extracting more facts, and the relation between found species and compounds. This would be done by extending ami. However, it became clear there will be big improvements to ami in the future, and things like ChemicalTagger and OSCAR are planned to be implemented anyway. It's better to wait for those things to complete before extending it for my own purposes.

Instead I improved the card page for future use. I didn't have too much time to do stuff this week, so I mainly wanted to demonstrate how you could use it with other data.

Article page

Here it is. It's very similar, of course. It has the same design, and comparable data structure. Word clouds now work. You can view them by opening an article and clicking on "Click to load word cloud". It uses a custom API using cloud.js (repo, license). It works by providing URL parameters file (URL of file with ctj output structure containing word data) and pmcid (PubMed Central ID).

I'll talk more about the process of getting data to display in a similar manner. Below is a command dump, but this doesn't cover custom programs. First you get papers, with getpapers. I used the query 'ABSTRACT:zika OR ABSTRACT:dengue OR ABSTRACT:spondweni'. There is nothing really special to this. ABSTRACT: helps with assuring the article remotely covers it, and the other parts are just topics. You can replace this to anything you want. You can use the limit 500 for now.

Then, you take it through the ContentMine pipeline (i.e. norma and ami). You use the ami plugins ami2-species, ami2-words and ami2-sequence. This gives a file system as output, which you can convert to JSON with ctj. Now you minify the file size by removing all data you don't use with c05.js, which I'll document later. The file paths are hard-coded but if you stick to the file structure I've used in the command dump it should work. Finally, you change the file paths in card_c05.html to what you want.

To make the wordcloud API working, you use c05-words.js. The file paths are hard-coded in this file as well, so look out for that. It may try to save a file in a directory that doesn't exist. I'll solve this sometime. Change the file path at line 208 to the output of c05-words.js, and you should be done... Note that you can't load files with a file:// protocol, so you may have to host it somewhere.

Commands used

Next week I'll probably add a better search function and similar things, and see if I can help with extending ami.

Sunday, September 4, 2016

Weekly Report 7: Interactive ContentMine output

This weekly report covers the past two weeks. I blogged twice last week, and I figured that was enough.

Last week I blogged about word clouds from ContentMine output. I also blogged about ctj. This week, I have combined both into interactive lists, as seen here and in the images below.

List overview. From left to right: articles, and genus/genera
and species that were mentioned in the articles.

Search results. Here one paper (doi:10.1186/1471-2164-13-589)

I made a NodeJS JSONtoJSON converter (here). It takes the ctj output, strips all information that I don't use, generates some lists, and outputs a minified JSON file. I load this in the HTML-file and generate the "cards". I'll probably move that process to a NodeJS file as well. This will cause a larger filesize, but hopefully a shorter loading time. I also need to make the scrolling more effecient; I don't need to load cards people don't view.

The "generate word cloud" button doesn't work yet, because it currently needs to load data from a file that's to big to put on GitHub efficiently. I'll fix this later.

In the next few weeks I'll fix the issues above and start to see how I can extract more "facts". Currently I only know where what is mentioned, where "what" is limited to species, genus, words, human genes, and regex matches. In the future I want to find metabolites, chemicals and the relation between these and conifer species.

Conifer taxonomy - Part 2

Continuation of this post.

I got an answer quite quickly (but after posting the previous post):

The Plant List marks what species are in what genus and family, and groups families in Major Groups, e.g. gymnosperms. It also marks synonyms. With a list of conifer species and the ContentMine output, I can determine which species are not conifers, and find how they interact with each other. Now I only have a list of species and genera, without any context.

The site does not really answer the question we asked in the previous post, of how families are grouped inside the gymnosperm group, but I seem to have been mistaken on that part anyway. The page about gymnosperms does have some different ideas over how families are divided, but the Pinophyta page does not state that it isn't part of the gymnosperms. It just says it with different words. It seems to say that it is part of one of gymnosperms and angiosperms, not the group containing both.

Pine (own photo)

Anyway, that is not really important. I now have the information I want, on how species are grouped in genera and families. That is what I need, at least for now. I don't have to focus on things outside the conifer families, so I do not think I really have to know how they are divided. This concludes our quest for data for now, see my other blog posts for progress with the project.