Sunday, December 31, 2023

Citation.js: 2023 in review

This past year was relatively quiet for Citation.js, as changes to the more prominent components (BibTeX, RIS, Wikidata) start to slow down. I believe this is a good sign, and that it indicates the quality of the mappings is high. Nonetheless, following the reworks of BibTeX and RIS in the past couple of years, some room for improvement still came up.


Tytthaspis sedecimpunctata, observed May 9th, 2021, Sint-Oedenrode, The Netherlands.

Changes

  • BibTeX: The mappings of the fields howpublished, langid, and addendum are improved. Plain BibTeX now allows doi, url, and more (see below).
  • RIS: PY (publication year) is now always exported, and imported more resiliently.
  • Wikidata: Names of institutions, and author names that differ from the person name, are now handled better.
  • CSL: a single citation entry can now be exported as documented.

New plugins

New users

  • I started working on a publicly available API running Citation.js on Wikimedia Toolforge. This API can be used to extract bibliographical data from Wikidata items, or to import bibliographical data into Wikidata. Available at https://citation-js.toolforge.org/.
  • I found out that Codeberg uses Citation.js for the “Cite this repository” feature.

New Year’s Eve tradition

After the releases on New Year’s Eve of 2016, 2017, 2021, and 2022, this New Year’s Eve also brings the new v0.7.5 release. It turns out that plain BibTeX has more fields than documented in the manual! At some point, natbib introduced the plainnat styles which include doi, eid, isbn, issn, and url. These are now supported in bibtex export, as well as strict BibTeX import. (The default, non-strict BibTeX import is basically just BibLaTeX, for which these fields were already supported.) Thank you to @jheer and @kc9jud for bringing this up!

Happy New Year!

Saturday, August 12, 2023

Finding shield bug nymphs on iNaturalist

Working on translating a key to the European shield bug nymphs (Puchkov, 1961) I thought I would look for pictures of the earlier life stages (nymphs, Fig. 1) of shield bugs (Pentatomoidea) on iNaturalist and found few observations actually had the life stage annotation. I do not have the exact numbers of Europe as a whole at that point in time, but Denmark currently has around 19.8% and the United Kingdom has around 29.4% of the observations annotated (GBIF.org, 2023).

Figure 1: Fourth instar nymph of Nezara viridula (Linnaeus, 1758). 2023.vi.22, Bad Bellingen, Germany.

So I set out to add those annotations myself instead, starting with the Netherlands, followed by the rest of the Benelux, Germany, and Ireland. Last Monday, I finished the annotating the observations of France. These regions total to about 80 000 observations, of which I annotated a bit more than 40 000 (again, I do not have the exact numbers from before I started).

Methods

I made these annotations on the iNaturalist Identify tool which has plenty of keyboard shortcuts that I found after using the mouse for 2000 observations. This allowed me to develop some muscle memory, and I ended up annotating a single page of 30 observations in around 60 seconds, so 2 seconds per observation. Most of that time was usually spent waiting for the images to load, and there were plenty of small glitches in the interface to further slow me down (including a memory leak requiring me to reload every 10-ish pages).

I was not able to annotate 715 of the verifiable observations (i.e. those with pictures, a location, and a time). In some cases, the pictures were simply not clear enough (or taken too closely) for me to determine with certainty the life stage. Another issue I had to work around were observations of multiple individuals at different life stages. Common were observations of egg clusters and just-hatched nymphs of Halyomorpha halys (Stål, 1855); the “parent bug” Elasmucha grisea (Linnaeus, 1758) doing parenting; kale plants infested with adults and nymphs of Eurydema; and adults of various species in the process of laying eggs. However, there were also many observations containing multiple pictures where one was of an adult and a second of a nymph, with no indication that it was the same individual at different times. There is currently no way to annotate multiple life stages on a single observation on iNaturalist except through non-standard observation fields, which are a lot more laborious to use and can be disabled by users.

Results

Coloring the observations by life stage on a map clearly shows the effect of the work, with the aforementioned countries covered in red; and the most of the rest of Europe in blue (Fig. 2). (There are two other notable red patches, in Abruzzo, Italy and in Granada, Spain. These are not my doing, and seem to be caused by two prolific observers annotating their own observations, respectively esant and aggranada.)

Figure 2: Map of research-grade iNaturalist observations of Pentatomoidea in Europe, colored by whether or not they have a life stage annotation.

These annotations mean additional data is available on the seasonality of these species. For example, looking at the four most observed species already reveals that Pentatoma rufipes (Linnaeus, 1758) overwinters as nymphs, whereas the other three species overwinter as adults (Fig. 3). The larger volume of data also means that more detailed analyses with more explanatory variables can be carried out. For example, the effect of climate change on the life cycle of invasive species like H. halys.

Figure 3: Seasonality of nymphs and adults of the four most of observed shield bug species.

In addition, for less common species the classification of life stages makes it possible to find more about the morphology of the earlier life stages of these species. This is useful for individuals who are working on keys (such as me), but perhaps also for computer vision models. Classifying the not-yet identified observations of nymphs as such also allows for more targeted searches by identifiers, potentially leading to even more research-grade observations of rarer species.

It should be said though, that even Chlorochroa pinicola (Mulsant & Rey, 1852), which is not particularly common in West Europe, still has many more validated pictures on Waarneming.nl than on iNaturalist. In fact, nearly half (43.2%) of all observations with images of Pentatomoidea in Europe are in the Netherlands. These are not all annotated with a life stage though, and the Observation.org platform (which Waarneming.nl is part of) seemingly only allows curators and observers add life stage annotations to an observation.

Luckily, iNaturalist does allow for this and enables me to contribute hopefully valuable data to GBIF for further analysis, by myself or by others. I will continue adding annotations — I have now started on the observations from Switzerland, luckily a lot fewer than those from France. At the same time, I am maintaining the high rate of annotation in the countries I have already annotated. In August, this means annotating about 200 observations per day (10–15 minutes) which is entirely doable. It does quickly start to add up if you are on holiday for a week, as you do in August, but that is still fewer observations than the entirety of France. Still, for this reason I hope other identifiers (or even better, observers) start annotating more as well.

References


Written with StackEdit.

Monday, May 8, 2023

Ingesting Structured Descriptive Data into Pandoc

Ever since I found out about Structured Descriptive Data (SDD) I keep coming back to it. Right now I am planning to make two identification keys, one translated and modernized from a 1961 publication for which I have not found a good alternative, and one completely de novo (well, nearly). In both cases, I started writing out the key in a word processor with a numbered list and tabs and manual alignment etc., but in both cases that experience was pretty bad so far.

If the keys were interactive, multi-access keys there would be plenty of alternatives to word processors, including Xper3. However, these are both single-access, simple, dichotomous keys. There are CTAN packages to display such keys in LaTeX (biokey, dichokey), but those packages seem limited in several ways, and given how difficult LaTeX can be to parse, I would be locked into that choice. Normally I would encode the key data in some simple JSON format and generate HTML followed by a PDF, but at that point it makes as much sense to encode the data in SDD instead. One problem: how do I go from an SDD file to a PDF? Or HTML for a webpage, for that matter?

Pandoc

That is were Pandoc comes in. By creating a custom reader that transforms SDD files into an Pandoc AST, files can be generated as PDF, HTML, LaTeX, Word, ODT, and more. These custom readers allow for external dependencies so a pure-Lua XML parser library like xml2lua is easily imported. The challenge then becomes twofold:

  1. ‘Implementing’ the SDD format itself, and
  2. Building a layout that looks good in the various different formats, like HTML and PDF (which is LaTeX-based in Pandoc).

Firstly, SDD is in many places under-specified, showing only some suggested ways of encoding data, and not showing any examples in other places. This means that I personally need to decide on how I want certain data to be encoded in SDD (for the Pandoc reader), if multiple ways are possible.

Secondly, where in HTML (+ CSS) I could use float: right; or display: grid; to make pretty, dichotomous keys, or in LaTeX and for the PDF the environment longtblr from tabularray, a single design that works for both is more difficult. This is especially the case if you want to include numbered back-references, as it eliminates the possibility of using simple numbered lists.

Well, it turns out there is a trick for that. The AST supports RawInline, with which a reader can insert markup specific to the output format. pandoc.RawInline('html', '<span style="float: right;">) for HTML and pandoc.RawInline('tex', '\hfill') for LaTeX (and PDF), et cetera. It is still better to use regular Pandoc features as much as possible to allow for more flexibility in selecting output formats, but I feel like progressive enhancements like these are definitely “okay” to apply.

SDD limitations

In spite of the freedom and under-specification that SDD gives in some places, the format (as documented) is limited in other places. Apart from not being able to incorporate related formats from TDWG such as Darwin Core (other than through non-standard extensions), some of features are missing:

  • Endpoints of identification keys (<Lead>) can only have a <TaxonName> or a <Subkey>, not both, meaning that the many publications that list the key to genera separate to each genus subkey to species, cannot be encoded in an optimal way.
  • Endpoints of identification keys (<Lead>) are only shown to have one <TaxonName>, so keys leading to a species pair (or a differently-sized, non-taxonomic group of species) cannot be encoded. (At the same time, descriptions can be linked to multiple taxa with <Scope>.)
  • Descriptions (<NaturalLanguageData>) and identification keys are not shown to allow markup. Confusingly, the term “markup” is used, but as far as I can tell only for annotating the text with concepts and publications; not for putting taxon names in italics or other styling.

Results

After quite some time I was finally content with the key that I was testing it with (Fig. 1). The resulting filter is available at https://github.com/larsgw/pandoc-reader-sdd, and will probably see continued development as I try to use it for other keys. A major bottleneck is still getting the SDD in the first place, but I hope that becomes easier as I continue developing scripts and tooling. Overall though, the Pandoc reader has been a great success.

Screenshot of page 17 and 18 of a document showing a key to the genera of Scutelleridae, a drawing of Odontoscelis fuliginosus, and the start of a key to Odontoscelis.
Figure 1: Page 17 and 18 of the PDF, made with Pandoc from a Structured Descriptive Data XML file. Translated, original text and figures from Puchkov, 1961 (Fauna of Ukraine, vol. 21 (1), http://www.izan.kiev.ua/fau-ukr/vol21iss1.htm).

Wednesday, February 1, 2023

On the modelling of the content of identification keys

Last november I wrote a blog post about how to model the taxonomic coverage of identification keys. I wanted to model this coverage to be able to determine to what extent an identification key applies to a given observation or specimen, for use in my Library of Identification Resources project. For the same project I also find it useful to be able to archive identification keys. Many keys can be downloaded as PDFs or saved in the Internet Archive. However, some keys cannot be archived so easily:

  • Matrix-type keys (also known as multi-access keys) are usually presented as interactive apps. Sometimes these work from the Wayback Machine, but sometimes they dynamically load additional web resources in which case they do not work. Although this is less common, it also occurs for single-access keys.
  • Even non-interactive dichotomous keys can be spread out over multiple web pages. For example, in B477 each step has its own page. This can be crawled normally (although it is not yet in this example) but the key still is not in one place.
  • Some keys only exist in obscure print (or CD) sources or outdated or otherwise obscure file formats, where it may be useful to have a more standard format to archive or re-publish the data in (taking copyright into account of course).

In these cases it may be useful to archive the keys in a different format, especially if conserving the source is not feasible. In this blog post I will examine how I think I can best express these keys in more standard formats.

Metallic blue-green moth with feathered antennae, sitting on a vertical flower stem. In the out-of-focus background there is grass, other stems, and pink flowers.
Adscita statices, observed June 28th, 2020

The Library of Identification Resources supports 7 types of resource types: collection, checklist, gallery, reference, key, matrix, and supplement. The first one, collection, is used for pages that link to a variety of different resource that are not all in the same series, and can be modelled with schema:Collection or something similar. checklist is also relatively simple, and can be modelled with a list of entities dwc:Taxon, which could be linked to a schema:ItemList.

gallery, reference, key, and matrix are a bit more difficult to model. gallery and reference are both supersets of checklists, that respectively have just an image, or a description and optionally an image, a distribution range, and/or more info for each taxon; key are single-access keys; and matrix are multi-access keys. Regarding gallery: sure, Schema.org has a type for schema:ImageGallery but that is specifically for websites, whereas gallery also includes booklets, flyers, and CD-based programs, and the schema:Collection type from before does not capture the cohesiveness in my opinion. reference, key, and matrix are even more difficult to capture accurately.

For these types, I would like to introduce Structured Descriptive Data (SDD)¹, a standard from TDWG from 2005 that defines an format to encode descriptions and single-access and multi-access keys in XML files. gallery, reference, key, and matrix can all be expressed in SDD files. However, I think it is missing some features.

  1. It is not linked data, so it is difficult to relate the content of the key with the metadata.
  2. SDD has a <Taxon> element but it would be great if SDD files were instead interoperable with the definitions from dwc:Taxon.
  3. For reference and gallery resources, it should be possible to specify the relation between an image and a taxon.
  4. For matrix keys, SDD has qualitative/discrete characters and quantitative/continuous characters. Characters relating to temporal, seasonal, and spatial distributions are common in matrix keys, but are difficult to express in SDD. I suggest temporal, seasonal, and spatial characters for both discrete and continuous characters. The former because old keys may for example have divided the continent into regions, but the user should be able to enter coordinates (so a regular discrete character does not suffice); the latter because newer keys may use the future to include more detailed distribution ranges.
  5. For matrix keys it may be useful to assign probabilities to discrete character states per taxon.

Lastly, supplement remains. This is a difficult one, even when just listing the taxonomic coverage. The goal when modelling this would be to list the differences between the original source and the “fixed” version in a standard format. If all original sources (i.e., the above types) could be expressed with linked data, the standard format could be the Linked Data Patch Format. Of course this would mean that unlike the other types, supplement would not be expressed in linked data, which is a bit unfortunate.


¹ At the time of writing, TDWG only hosts SDD Primer in the twiki source format on GitHub. I have taken the liberty to parse the twiki source and produce HTML: https://larsgw.github.io/tdwg-wiki-archive/SDD/Primer/SddIntroduction.html