Saturday, August 6, 2022

Library of Identification Resources

Since around this time last year, I have been working on creating a library of identification resources. Here, “identification resources” are identification keys, multi-access (matrix) keys, other works that can aid in the identification of species. The project is managed on GitHub: https://github.com/identification-resources.

Digital drawing of a black moth with red spots and light speckles on the forewings and red hindwings. Below the moth is a grey bar signifying a label, and the whole picture has a light beige background.
The logo: Zygaena filipendulae (Linnaeus, 1758)

About the project

Right now the project mainly consists of a catalog, a list of works (or rather “manifestations” in the FRBR model) that contain identification resources. The catalog contains bibliographic information, but also a summary of the identification resource(s) that they contain.

This summary consists of the starting taxon or taxa (e.g. Chrysididae), a region that the resource applies to (e.g. Europe), an optional scope (e.g. adults, males and females), the taxonomic rank(s) that the resource distinguishes (usually species), and whether the key was supposed to be complete for the region-scope combination at the time of publication.

This data is similar to that provided by BioInfo UK (example). They actually go in more detail, with more nuance in completeness, the equipment required for using the key (e.g. stereo microscope), whether and what specimen preparation is required, and expert reviews. The Library of Identification Resources does not include this information, but has a broader geographical scope¹ and can be searched in a more structural way.

The places (for the geographical scope of keys) and the authors and publishers linked to works as well as the works themselves are linked to Wikidata entities wherever possible. Journals and book series are linked to ISSNs.

Using the data

On top of this catalog, I built a website: https://identification-resources.github.io/. This provides a user interface, allowing for structured searches through the catalog, linking multiple versions (FRBR “manifestation”) of the same work (FRBR “work”), a few graphs exploring the distribution of languages, licenses, etc., and some additional functionality like exporting citations.

To improve searching, I am also working on an app that lists the most relevant resources for a given situation. In the proof-of-concept (GitHub) users can enter a parent taxon and a coordinate location and get (possibly) applicable resources for that situation. This uses the GBIF API for taxon searches, and the iNaturalist API for looking up places around coordinates.

Results are sorted according to a few heuristics, including:

  • The availability of full text.
  • How close the key taxon is: a key for Vespidae is likely better than one for Insecta, when identifiying Vespa.
  • How recent the resource is.
  • Whether the level of detail indicates that the key can actually improve on the parent taxon.
  • Whether the resource was considered complete during publication.
  • Whether the resource is a (matrix) key, just a reference or a photo gallery, or even a just collection of other resources (which are not guaranteed to contain anything relevant, and if they did, they would likely show up in the results already).

The data is currently biased towards the Netherlands and the rest of western Europe, as well as towards insects and mainly Hymenoptera. This is because I have mainly worked on adding resources that I was using, or resources referenced therein.

Screenshot of a website. The header has a logo of a moth, the text "Library of Identification Resources" and three links, "Catalog", "Statistics", and "GitHub". In the main part: in the top left is a paragraph of text; in the top right two input fields labeled "Taxon" and "Location" and a button labeled "Search"; on the left is a table listing items and on the right is an embedded PDF viewer
The proof-of-concept app in action.

Future plans

Improving the interface

I want to improve the interface of the proof-of-concept app and I have a number of ideas for this:

  • Allow users to “upload” photos. These would not be uploaded to a server but rather just displayed locally, but could serve a couple of purposes:
    • Automatic extraction of coordinates from EXIF data
    • Automatic determination of a parent taxon using computer vision [needed: a computer vision model :)]
    • Viewing the key side-by-side with the photos.
  • Same, but by entering a GBIF/iNaturalist observation ID.

Improving the data model

One of the main improvements that I want to make is modeling of individual keys/resources within works for two reasons:

  1. Works can contain multiple resources with completely different properties. If a work contains a key of Family A to Genus B and C, a key to the species of Genus B, and a key to the species of Genus C, those can be modeled as a single key to the species of Family A (assuming B and C are the only relevant genera). But this is not always the case. In addition, a work can contain resources of different types, like a key and a checklist (B159).
  2. Listing the taxa in the resource. I am aware of a few problems with this, but a list of species in a key can be matched to a modern checklist. This gives a better idea about how well a British key could apply to a Dutch observation (or an older British key to a more recent British observation, with all the species migration going on).

A screenshot of a collapsible taxonomic tree, where each taxon has a taxon name, a rank in grey text, a link icon and usually a GBIF icon
Example of what this would look like.

Contributing data

You are welcome to contribute data to the catalog. I plan to make this easier in the future, as currently the master copy of the data is still in the non-version-controlled spreadsheet that I started the work in.


¹ Note that the current data has a geographical bias, but this is not the same as a strict scope in my opinion.