Last november I wrote a blog post about how to model the taxonomic coverage of identification keys. I wanted to model this coverage to be able to determine to what extent an identification key applies to a given observation or specimen, for use in my Library of Identification Resources project. For the same project I also find it useful to be able to archive identification keys. Many keys can be downloaded as PDFs or saved in the Internet Archive. However, some keys cannot be archived so easily:
- Matrix-type keys (also known as multi-access keys) are usually presented as interactive apps. Sometimes these work from the Wayback Machine, but sometimes they dynamically load additional web resources in which case they do not work. Although this is less common, it also occurs for single-access keys.
- Even non-interactive dichotomous keys can be spread out over multiple web pages. For example, in B477 each step has its own page. This can be crawled normally (although it is not yet in this example) but the key still is not in one place.
- Some keys only exist in obscure print (or CD) sources or outdated or otherwise obscure file formats, where it may be useful to have a more standard format to archive or re-publish the data in (taking copyright into account of course).
In these cases it may be useful to archive the keys in a different format, especially if conserving the source is not feasible. In this blog post I will examine how I think I can best express these keys in more standard formats.
Adscita statices, observed June 28th, 2020
The Library of Identification Resources supports 7 types of resource types: collection
, checklist
, gallery
, reference
, key
, matrix
, and supplement
. The first one, collection
, is used for pages that link to a variety of different resource that are not all in the same series, and can be modelled with schema:Collection
or something similar. checklist
is also relatively simple, and can be modelled with a list of entities dwc:Taxon
, which could be linked to a schema:ItemList
.
gallery
, reference
, key
, and matrix
are a bit more difficult to model. gallery
and reference
are both supersets of checklists, that respectively have just an image, or a description and optionally an image, a distribution range, and/or more info for each taxon; key
are single-access keys; and matrix
are multi-access keys. Regarding gallery
: sure, Schema.org has a type for schema:ImageGallery
but that is specifically for websites, whereas gallery
also includes booklets, flyers, and CD-based programs, and the schema:Collection
type from before does not capture the cohesiveness in my opinion. reference
, key
, and matrix
are even more difficult to capture accurately.
For these types, I would like to introduce Structured Descriptive Data (SDD)¹, a standard from TDWG from 2005 that defines an format to encode descriptions and single-access and multi-access keys in XML files. gallery
, reference
, key
, and matrix
can all be expressed in SDD files. However, I think it is missing some features.
- It is not linked data, so it is difficult to relate the content of the key with the metadata.
- SDD has a
<Taxon>
element but it would be great if SDD files were instead interoperable with the definitions fromdwc:Taxon
. - For
reference
andgallery
resources, it should be possible to specify the relation between an image and a taxon. - For
matrix
keys, SDD has qualitative/discrete characters and quantitative/continuous characters. Characters relating to temporal, seasonal, and spatial distributions are common in matrix keys, but are difficult to express in SDD. I suggest temporal, seasonal, and spatial characters for both discrete and continuous characters. The former because old keys may for example have divided the continent into regions, but the user should be able to enter coordinates (so a regular discrete character does not suffice); the latter because newer keys may use the future to include more detailed distribution ranges. - For
matrix
keys it may be useful to assign probabilities to discrete character states per taxon.
Lastly, supplement
remains. This is a difficult one, even when just listing the taxonomic coverage. The goal when modelling this would be to list the differences between the original source and the “fixed” version in a standard format. If all original sources (i.e., the above types) could be expressed with linked data, the standard format could be the Linked Data Patch Format. Of course this would mean that unlike the other types, supplement
would not be expressed in linked data, which is a bit unfortunate.
¹ At the time of writing, TDWG only hosts SDD Primer in the twiki source format on GitHub. I have taken the liberty to parse the twiki source and produce HTML: https://larsgw.github.io/tdwg-wiki-archive/SDD/Primer/SddIntroduction.html
No comments:
Post a Comment