Monday, May 8, 2023

Ingesting Structured Descriptive Data into Pandoc

Ever since I found out about Structured Descriptive Data (SDD) I keep coming back to it. Right now I am planning to make two identification keys, one translated and modernized from a 1961 publication for which I have not found a good alternative, and one completely de novo (well, nearly). In both cases, I started writing out the key in a word processor with a numbered list and tabs and manual alignment etc., but in both cases that experience was pretty bad so far.

If the keys were interactive, multi-access keys there would be plenty of alternatives to word processors, including Xper3. However, these are both single-access, simple, dichotomous keys. There are CTAN packages to display such keys in LaTeX (biokey, dichokey), but those packages seem limited in several ways, and given how difficult LaTeX can be to parse, I would be locked into that choice. Normally I would encode the key data in some simple JSON format and generate HTML followed by a PDF, but at that point it makes as much sense to encode the data in SDD instead. One problem: how do I go from an SDD file to a PDF? Or HTML for a webpage, for that matter?

Pandoc

That is were Pandoc comes in. By creating a custom reader that transforms SDD files into an Pandoc AST, files can be generated as PDF, HTML, LaTeX, Word, ODT, and more. These custom readers allow for external dependencies so a pure-Lua XML parser library like xml2lua is easily imported. The challenge then becomes twofold:

  1. ‘Implementing’ the SDD format itself, and
  2. Building a layout that looks good in the various different formats, like HTML and PDF (which is LaTeX-based in Pandoc).

Firstly, SDD is in many places under-specified, showing only some suggested ways of encoding data, and not showing any examples in other places. This means that I personally need to decide on how I want certain data to be encoded in SDD (for the Pandoc reader), if multiple ways are possible.

Secondly, where in HTML (+ CSS) I could use float: right; or display: grid; to make pretty, dichotomous keys, or in LaTeX and for the PDF the environment longtblr from tabularray, a single design that works for both is more difficult. This is especially the case if you want to include numbered back-references, as it eliminates the possibility of using simple numbered lists.

Well, it turns out there is a trick for that. The AST supports RawInline, with which a reader can insert markup specific to the output format. pandoc.RawInline('html', '<span style="float: right;">) for HTML and pandoc.RawInline('tex', '\hfill') for LaTeX (and PDF), et cetera. It is still better to use regular Pandoc features as much as possible to allow for more flexibility in selecting output formats, but I feel like progressive enhancements like these are definitely “okay” to apply.

SDD limitations

In spite of the freedom and under-specification that SDD gives in some places, the format (as documented) is limited in other places. Apart from not being able to incorporate related formats from TDWG such as Darwin Core (other than through non-standard extensions), some of features are missing:

  • Endpoints of identification keys (<Lead>) can only have a <TaxonName> or a <Subkey>, not both, meaning that the many publications that list the key to genera separate to each genus subkey to species, cannot be encoded in an optimal way.
  • Endpoints of identification keys (<Lead>) are only shown to have one <TaxonName>, so keys leading to a species pair (or a differently-sized, non-taxonomic group of species) cannot be encoded. (At the same time, descriptions can be linked to multiple taxa with <Scope>.)
  • Descriptions (<NaturalLanguageData>) and identification keys are not shown to allow markup. Confusingly, the term “markup” is used, but as far as I can tell only for annotating the text with concepts and publications; not for putting taxon names in italics or other styling.

Results

After quite some time I was finally content with the key that I was testing it with (Fig. 1). The resulting filter is available at https://github.com/larsgw/pandoc-reader-sdd, and will probably see continued development as I try to use it for other keys. A major bottleneck is still getting the SDD in the first place, but I hope that becomes easier as I continue developing scripts and tooling. Overall though, the Pandoc reader has been a great success.

Screenshot of page 17 and 18 of a document showing a key to the genera of Scutelleridae, a drawing of Odontoscelis fuliginosus, and the start of a key to Odontoscelis.
Figure 1: Page 17 and 18 of the PDF, made with Pandoc from a Structured Descriptive Data XML file. Translated, original text and figures from Puchkov, 1961 (Fauna of Ukraine, vol. 21 (1), http://www.izan.kiev.ua/fau-ukr/vol21iss1.htm).