Wednesday, February 1, 2023

On the modelling of the content of identification keys

Last november I wrote a blog post about how to model the taxonomic coverage of identification keys. I wanted to model this coverage to be able to determine to what extent an identification key applies to a given observation or specimen, for use in my Library of Identification Resources project. For the same project I also find it useful to be able to archive identification keys. Many keys can be downloaded as PDFs or saved in the Internet Archive. However, some keys cannot be archived so easily:

  • Matrix-type keys (also known as multi-access keys) are usually presented as interactive apps. Sometimes these work from the Wayback Machine, but sometimes they dynamically load additional web resources in which case they do not work. Although this is less common, it also occurs for single-access keys.
  • Even non-interactive dichotomous keys can be spread out over multiple web pages. For example, in B477 each step has its own page. This can be crawled normally (although it is not yet in this example) but the key still is not in one place.
  • Some keys only exist in obscure print (or CD) sources or outdated or otherwise obscure file formats, where it may be useful to have a more standard format to archive or re-publish the data in (taking copyright into account of course).

In these cases it may be useful to archive the keys in a different format, especially if conserving the source is not feasible. In this blog post I will examine how I think I can best express these keys in more standard formats.

Metallic blue-green moth with feathered antennae, sitting on a vertical flower stem. In the out-of-focus background there is grass, other stems, and pink flowers.
Adscita statices, observed June 28th, 2020

The Library of Identification Resources supports 7 types of resource types: collection, checklist, gallery, reference, key, matrix, and supplement. The first one, collection, is used for pages that link to a variety of different resource that are not all in the same series, and can be modelled with schema:Collection or something similar. checklist is also relatively simple, and can be modelled with a list of entities dwc:Taxon, which could be linked to a schema:ItemList.

gallery, reference, key, and matrix are a bit more difficult to model. gallery and reference are both supersets of checklists, that respectively have just an image, or a description and optionally an image, a distribution range, and/or more info for each taxon; key are single-access keys; and matrix are multi-access keys. Regarding gallery: sure, Schema.org has a type for schema:ImageGallery but that is specifically for websites, whereas gallery also includes booklets, flyers, and CD-based programs, and the schema:Collection type from before does not capture the cohesiveness in my opinion. reference, key, and matrix are even more difficult to capture accurately.

For these types, I would like to introduce Structured Descriptive Data (SDD)¹, a standard from TDWG from 2005 that defines an format to encode descriptions and single-access and multi-access keys in XML files. gallery, reference, key, and matrix can all be expressed in SDD files. However, I think it is missing some features.

  1. It is not linked data, so it is difficult to relate the content of the key with the metadata.
  2. SDD has a <Taxon> element but it would be great if SDD files were instead interoperable with the definitions from dwc:Taxon.
  3. For reference and gallery resources, it should be possible to specify the relation between an image and a taxon.
  4. For matrix keys, SDD has qualitative/discrete characters and quantitative/continuous characters. Characters relating to temporal, seasonal, and spatial distributions are common in matrix keys, but are difficult to express in SDD. I suggest temporal, seasonal, and spatial characters for both discrete and continuous characters. The former because old keys may for example have divided the continent into regions, but the user should be able to enter coordinates (so a regular discrete character does not suffice); the latter because newer keys may use the future to include more detailed distribution ranges.
  5. For matrix keys it may be useful to assign probabilities to discrete character states per taxon.

Lastly, supplement remains. This is a difficult one, even when just listing the taxonomic coverage. The goal when modelling this would be to list the differences between the original source and the “fixed” version in a standard format. If all original sources (i.e., the above types) could be expressed with linked data, the standard format could be the Linked Data Patch Format. Of course this would mean that unlike the other types, supplement would not be expressed in linked data, which is a bit unfortunate.


¹ At the time of writing, TDWG only hosts SDD Primer in the twiki source format on GitHub. I have taken the liberty to parse the twiki source and produce HTML: https://larsgw.github.io/tdwg-wiki-archive/SDD/Primer/SddIntroduction.html