Showing posts with label taxonomy. Show all posts
Showing posts with label taxonomy. Show all posts

Monday, September 1, 2025

New paper: "Library of Identification Resources: a FAIR overview of taxonomic keys"

Biodiversity research is supported by an ever-increasing volume of citizen science observations, on platforms such as Waarneming.nl/Observation.org and iNaturalist.org. Taxonomic expertise is essential to sustain these platforms, but can be difficult to spread due to the decentralized nature of many citizen science projects. In our new scientific article in Biodiversity Data Journal we describe how and why to record information resources for the taxonomic identification of organisms in a FAIR database, and how to query that data to find applicable resources for an observation.

So I created the Library of Identification Resources (LoIR) which so far contains 2,158 records of such information resources, 54% of which are freely available online. At the moment, most resources are meant for groups of insects in parts of Northwestern Europe, but anyone can help by adding more resources!

See below for caption
Fig. 1: Geographical and taxonomic focus of the resources currently included in the Library of Identification Resources. (A) Choropleth of the geographic scopes of resources in the catalog. 460 publications with a geographic scope that cannot be expressed in administrative borders were omitted. (B) Breakdown of publications by the taxonomic group and continent. Publications spanning multiple continents and/or multiple taxonomic groups are counted for the category “Other”.

A major feature of the LoIR is a special search engine, where someone can enter an observation of an organism, for example a hoverfly in Nijmegen, The Netherlands, and it returns the most applicable resources for that observation. It works by comparing the list of expected species of hoverflies in The Netherlands to the different available resources. Try it out!

As the database and search engine grow, more and more citizen scientists should be able to find the resources needed to continue their extensive work.

The article, written with Eelke Jongejans, can be found here: https://doi.org/10.3897/BDJ.13.e161726

Monday, March 24, 2025

New paper: Identification of Cholevinae larvae

In 2022, I started my Master’s in Biology, at Radboud University in the Netherlands where I had just finished my Bachelor’s degree. The Master’s programme includes two research internships of 36 EC (approx. 6 months), both of which include writing a thesis. As I had been working on a database of identification keys, I was interested in a project focused on taxonomy for my first research internship.

Thanks to Henk Siepel I ended up contacting Menno Schilthuizen at Naturalis, who suggested I work on Cholevinae larvae. Schilthuizen had been collecting Cholevinae larvae since the 1980s, and had also received material from Peter Zwick who started collecting larvae in different areas of Germany in the 1960s. The challenge was to use this material to make an identification key based on these specimens.

Although the first description of a larva of Cholevinae was published back in 1961 by J. C. Schiødte, descriptions have since been relatively few and far between. This also meant that there are almost no existing identification keys for the larval Cholevinae. Making these descriptions and keys is difficult, as you need larvae from a known species. This is only possible if the larvae are cultured from adults, which takes time and effort, if molts are collected and the emerged adults is identified, or if DNA barcoding can be used. The specimens collected by Zwick and Schilthuizen mainly used the first method.

However, there happened to be a recent, detailed description of Sciodrepoides watsoni, a species for which I also had specimens. I started by comparing the larvae of S. watosni (as well as a few of the related S. fumatus) to the drawings and descriptions made by Kilian and Mądra. From there, I could start looking at different species and identify potential areas and types of characteristics that are consistent enough within a species, but that differ between separate species. To illustrate these differences I also made schematic drawings (Fig. 1) of different sets of characteristic features. Finally, I measured certain parts of the larvae, where possible for specimens preserved in microscope slides.


Figure 1: Illustrations of Cholevinae larvae

At the end of the 6 months, I had a complete key to all species for which specimens were available, but only for the 1st instar. When the larvae molt for the first time, they gain secondary bristles, grow in size, and more, meaning the identifying characteristics cannot always be used for both the 1st instar, and the 2nd and 3rd instars. I ended up spending another year or so to finalize the key for all instars. This includes 28 of the 39 species of Cholevinae occurring in the Netherlands, and a lot of descriptions for which no (detailed) description was available previously. In a true full circle moment, I could add my own work to the aforementioned database of identification keys (as B1860).

Ultimately, collaborating with Schilthuizen, Siepel, and Zwick, this culminated in an article, Comparative morphology of the larval stages of Cholevinae (Coleoptera: Leiodidae), with special reference to those in the Netherlands. We were able to publish this in the final issue of Tijdschrift voor Entomologie, which is unfortunately being discontinued after 167 volumes. Again, many thanks to Menno Schilthuizen, Peter Zwick, and Henk Siepel for this great opportunity. Check it out!

References

  • Willighagen, L. (2022, augustus 6). Library of Identification Resources. Syntaxus Baccata. https://doi.org/10.59350/h8qka-z4a05
  • Schiødte, J. C. (1861). De metamorphosi eleutheratorum observationes: Bidrag til insekternes udviklingshistorie (pp. 1–558). Thieles Bogtrykkeri. https://doi.org/10.5962/bhl.title.8797
  • Kilian, A., & Mądra, A. (2015). Comments on the biology of Sciodrepoides watsoni watsoni (Spence, 1813) with descriptions of larvae and pupa (Coleoptera: Leiodidae: Cholevinae). Zootaxa, 3955(1), 45–64. https://doi.org/10.11646/zootaxa.3955.1.2
  • Willighagen, L. G., Schilthuizen, M., Siepel, H., & Zwick, P. (2025). Comparative morphology of the larval stages of Cholevinae (Coleoptera: Leiodidae), with special reference to those in the Netherlands. Tijdschrift Voor Entomologie, 167, 59–101. https://doi.org/10.1163/22119434-bja10033

Written with StackEdit.

Saturday, August 12, 2023

Finding shield bug nymphs on iNaturalist

Working on translating a key to the European shield bug nymphs (Puchkov, 1961) I thought I would look for pictures of the earlier life stages (nymphs, Fig. 1) of shield bugs (Pentatomoidea) on iNaturalist and found few observations actually had the life stage annotation. I do not have the exact numbers of Europe as a whole at that point in time, but Denmark currently has around 19.8% and the United Kingdom has around 29.4% of the observations annotated (GBIF.org, 2023).

Figure 1: Fourth instar nymph of Nezara viridula (Linnaeus, 1758). 2023.vi.22, Bad Bellingen, Germany.

So I set out to add those annotations myself instead, starting with the Netherlands, followed by the rest of the Benelux, Germany, and Ireland. Last Monday, I finished the annotating the observations of France. These regions total to about 80 000 observations, of which I annotated a bit more than 40 000 (again, I do not have the exact numbers from before I started).

Methods

I made these annotations on the iNaturalist Identify tool which has plenty of keyboard shortcuts that I found after using the mouse for 2000 observations. This allowed me to develop some muscle memory, and I ended up annotating a single page of 30 observations in around 60 seconds, so 2 seconds per observation. Most of that time was usually spent waiting for the images to load, and there were plenty of small glitches in the interface to further slow me down (including a memory leak requiring me to reload every 10-ish pages).

I was not able to annotate 715 of the verifiable observations (i.e. those with pictures, a location, and a time). In some cases, the pictures were simply not clear enough (or taken too closely) for me to determine with certainty the life stage. Another issue I had to work around were observations of multiple individuals at different life stages. Common were observations of egg clusters and just-hatched nymphs of Halyomorpha halys (Stål, 1855); the “parent bug” Elasmucha grisea (Linnaeus, 1758) doing parenting; kale plants infested with adults and nymphs of Eurydema; and adults of various species in the process of laying eggs. However, there were also many observations containing multiple pictures where one was of an adult and a second of a nymph, with no indication that it was the same individual at different times. There is currently no way to annotate multiple life stages on a single observation on iNaturalist except through non-standard observation fields, which are a lot more laborious to use and can be disabled by users.

Results

Coloring the observations by life stage on a map clearly shows the effect of the work, with the aforementioned countries covered in red; and the most of the rest of Europe in blue (Fig. 2). (There are two other notable red patches, in Abruzzo, Italy and in Granada, Spain. These are not my doing, and seem to be caused by two prolific observers annotating their own observations, respectively esant and aggranada.)

Figure 2: Map of research-grade iNaturalist observations of Pentatomoidea in Europe, colored by whether or not they have a life stage annotation.

These annotations mean additional data is available on the seasonality of these species. For example, looking at the four most observed species already reveals that Pentatoma rufipes (Linnaeus, 1758) overwinters as nymphs, whereas the other three species overwinter as adults (Fig. 3). The larger volume of data also means that more detailed analyses with more explanatory variables can be carried out. For example, the effect of climate change on the life cycle of invasive species like H. halys.

Figure 3: Seasonality of nymphs and adults of the four most of observed shield bug species.

In addition, for less common species the classification of life stages makes it possible to find more about the morphology of the earlier life stages of these species. This is useful for individuals who are working on keys (such as me), but perhaps also for computer vision models. Classifying the not-yet identified observations of nymphs as such also allows for more targeted searches by identifiers, potentially leading to even more research-grade observations of rarer species.

It should be said though, that even Chlorochroa pinicola (Mulsant & Rey, 1852), which is not particularly common in West Europe, still has many more validated pictures on Waarneming.nl than on iNaturalist. In fact, nearly half (43.2%) of all observations with images of Pentatomoidea in Europe are in the Netherlands. These are not all annotated with a life stage though, and the Observation.org platform (which Waarneming.nl is part of) seemingly only allows curators and observers add life stage annotations to an observation.

Luckily, iNaturalist does allow for this and enables me to contribute hopefully valuable data to GBIF for further analysis, by myself or by others. I will continue adding annotations — I have now started on the observations from Switzerland, luckily a lot fewer than those from France. At the same time, I am maintaining the high rate of annotation in the countries I have already annotated. In August, this means annotating about 200 observations per day (10–15 minutes) which is entirely doable. It does quickly start to add up if you are on holiday for a week, as you do in August, but that is still fewer observations than the entirety of France. Still, for this reason I hope other identifiers (or even better, observers) start annotating more as well.

References


Written with StackEdit.

Wednesday, February 1, 2023

On the modelling of the content of identification keys

Last november I wrote a blog post about how to model the taxonomic coverage of identification keys. I wanted to model this coverage to be able to determine to what extent an identification key applies to a given observation or specimen, for use in my Library of Identification Resources project. For the same project I also find it useful to be able to archive identification keys. Many keys can be downloaded as PDFs or saved in the Internet Archive. However, some keys cannot be archived so easily:

  • Matrix-type keys (also known as multi-access keys) are usually presented as interactive apps. Sometimes these work from the Wayback Machine, but sometimes they dynamically load additional web resources in which case they do not work. Although this is less common, it also occurs for single-access keys.
  • Even non-interactive dichotomous keys can be spread out over multiple web pages. For example, in B477 each step has its own page. This can be crawled normally (although it is not yet in this example) but the key still is not in one place.
  • Some keys only exist in obscure print (or CD) sources or outdated or otherwise obscure file formats, where it may be useful to have a more standard format to archive or re-publish the data in (taking copyright into account of course).

In these cases it may be useful to archive the keys in a different format, especially if conserving the source is not feasible. In this blog post I will examine how I think I can best express these keys in more standard formats.

Metallic blue-green moth with feathered antennae, sitting on a vertical flower stem. In the out-of-focus background there is grass, other stems, and pink flowers.
Adscita statices, observed June 28th, 2020

The Library of Identification Resources supports 7 types of resource types: collection, checklist, gallery, reference, key, matrix, and supplement. The first one, collection, is used for pages that link to a variety of different resource that are not all in the same series, and can be modelled with schema:Collection or something similar. checklist is also relatively simple, and can be modelled with a list of entities dwc:Taxon, which could be linked to a schema:ItemList.

gallery, reference, key, and matrix are a bit more difficult to model. gallery and reference are both supersets of checklists, that respectively have just an image, or a description and optionally an image, a distribution range, and/or more info for each taxon; key are single-access keys; and matrix are multi-access keys. Regarding gallery: sure, Schema.org has a type for schema:ImageGallery but that is specifically for websites, whereas gallery also includes booklets, flyers, and CD-based programs, and the schema:Collection type from before does not capture the cohesiveness in my opinion. reference, key, and matrix are even more difficult to capture accurately.

For these types, I would like to introduce Structured Descriptive Data (SDD)¹, a standard from TDWG from 2005 that defines an format to encode descriptions and single-access and multi-access keys in XML files. gallery, reference, key, and matrix can all be expressed in SDD files. However, I think it is missing some features.

  1. It is not linked data, so it is difficult to relate the content of the key with the metadata.
  2. SDD has a <Taxon> element but it would be great if SDD files were instead interoperable with the definitions from dwc:Taxon.
  3. For reference and gallery resources, it should be possible to specify the relation between an image and a taxon.
  4. For matrix keys, SDD has qualitative/discrete characters and quantitative/continuous characters. Characters relating to temporal, seasonal, and spatial distributions are common in matrix keys, but are difficult to express in SDD. I suggest temporal, seasonal, and spatial characters for both discrete and continuous characters. The former because old keys may for example have divided the continent into regions, but the user should be able to enter coordinates (so a regular discrete character does not suffice); the latter because newer keys may use the future to include more detailed distribution ranges.
  5. For matrix keys it may be useful to assign probabilities to discrete character states per taxon.

Lastly, supplement remains. This is a difficult one, even when just listing the taxonomic coverage. The goal when modelling this would be to list the differences between the original source and the “fixed” version in a standard format. If all original sources (i.e., the above types) could be expressed with linked data, the standard format could be the Linked Data Patch Format. Of course this would mean that unlike the other types, supplement would not be expressed in linked data, which is a bit unfortunate.


¹ At the time of writing, TDWG only hosts SDD Primer in the twiki source format on GitHub. I have taken the liberty to parse the twiki source and produce HTML: https://larsgw.github.io/tdwg-wiki-archive/SDD/Primer/SddIntroduction.html

Wednesday, November 9, 2022

On the modelling and application of the taxonomic coverage of identification keys

The main feature of the Library of Identification Resources is the description of the identification key (or matrix, reference, etc.). This description should on its basis specify when the key can or should be used. I have initially split this description into the taxonomic coverage and the ‘scope’. The latter includes life stage and sex but also some restrictions on the taxonomic coverage that are more difficult to characterize, like “species that cause galls on plants in the Rosa genus” or “species that live in aquatic environments”.

Hairy, black moth with long antennae with a clubbed end, and narrow, blue-speckled wings with bright red spots, with a bright yellowish green background consisting of out-of-focus grass
Zygaena filipendulae (Linnaeus, 1758)

Simple solution

For the taxonomic coverage sensu stricto, I started with the parent taxon, e.g. the family or genus. However, many keys do not treat all the species in a family, and are instead limited by a geographic scope. This geographic scope should clearly also be included. Then there are more casual keys that can be very useful but may not be complete even at the time of publication, either excluding some rare species or only including some common species. This can be detailed (to some extent) with an incomplete/complete switch. Finally, although many keys are for species, there are some keys primarily for identifying genera, families, or other ranks. Below are some example works where these aspects do or do not apply.

Title Parent taxon Geographic scope Incomplete Target rank
B460: A revision of the world Embolemidae (Hymenoptera Chrysidoidea) Embolemidae
B1: Identification Key to the European Species of the Bee Genus Nomada Scopoli, 1770 (Hymenoptera: Apidae), Including 23 New Species Nomada Scopoli, 1770 Europe
B81: Key to some European species of Xylomyidae Xylomyidae Europe Yes
B63: MOSCHweb - Interactive key to the genera of the Palaearctic Tachinidae (Insecta: Diptera) Tachinidae Palaearctic realm genus

If only it were that simple

It became clear quite quickly that this is not enough. For one, parent taxon, geographic scope, and target rank should be able to contain multiple values. Additionally, as we saw before, some taxonomic coverages like “gallers on Rosa sp.” cannot be captured with these parameters unless the “Parent taxon” list gets very long and detailed.

Another, more common problem is that even the combination of a parent taxon and a geographic scope is hardly specific enough to be able to say whether an identification key is reliable and complete for a situation. Species are discovered, migrate, emigrate, and become extinct. Taxa are moved to different genera, families, and orders. In B659: Orthoptêres et Dermaptêres (Faune de France 3) from 1922, the order Orthoptera also contains Dictyoptera, currently a superorder consisting of cockroaches, stick insects, praying mantises, and more. This is a big problem too. Key questions that should be answerable are “How well does a British key apply to a Dutch observation?” and “How well does a British key from 1950 apply to a British observation from 2020?”

To account for the changes within higher-level taxa you might want to make a list of species that are included in the keys. For B81 for example, that could look like this:

Screenshot of a webpage titled "Identification resource B81:1", linked below. The relevant part of the webpage is a tree of taxa, consisting of one family containing two genera, containing 2 and 1 species respectively.
https://purl.org/identification-resources/resource/B81:1

This gives a very clear image of the taxonomic coverage of the key: it includes three species, Salva marginata, S. varia and Xylomya maculata. The inclusion of the species in this key has a permalink, and the taxon is linked to a GBIF identifier. The list of species can then be matched (especially with the identifiers) to current checklists for the region that the observation was made in. More on that later.

This solution can easily be used for more complex taxonomic coverages, like the aforementioned key for species that cause galls on Rosa species. The ‘species’ list can also be a taxon list and have e.g. genera as the lowest rank. Another advantage is that there is no longer a need to explicitly specify the geographic scope, whether or not the key was known to be incomplete at the time of writing, and what the target taxon rank of the key is. In addition, this specific implementation also allows for multiple keys per work, which has various uses.

Two purposes

One problem that still comes up is that these taxon lists have two purposes:

  1. They describe which taxa can be distinguished with the key.
  2. They describe which taxa are considered when making the key.

These two are at odds. In a key to genera it would be simple to make a list of these genera, fulfilling (1) but not (2): if a species in of those genera gets moved to a different one, or if an additional species in one of the existing genera appears, the key ‘breaks’. The latter could be fixed by just listing all species but at that point you might as well fix both problems with (2) by grouping the species by genus. To simultaneously fulfill purpose (1) and (2) it is necessary to divide the taxonomic tree into units that can be distinguished by the key.

Even when fulfilling purpose (1) it is still possible to partially accomplish (2) in keys for e.g. families that key out to species. If, when making the species list, higher ranks such as genera are included, the key to genera can be validated according to (2) without having to go into more detail than the key. Therefore it is important to capture the entire taxonomic tree as presented in the work.

Another thing that might be useful to record is the synonyms. Many of the more academic works publish along with the key a series of descriptions including synonyms. If the status of that synonymic taxon is now different from the status in the key, this may be of importance to the results of the identification. It comes down to the same split of purposes: (1) no distinction is made in the key between the two taxa but (2) both taxa are, in theory, considered. Either way it is important to list both in some way.

If only it were that simple

At this point it became clear which taxa are necessary to include in such a list. But how to describe the taxa in those lists? This is an (enormous) rabbit hole in and of itself, and if I had more experience I could make a Falsehoods Programmers Believe About Names-style post about it. It starts with the normal parts: every taxon has a name, an author, a year, and a rank. That rank can be kingdom, phylum, class, order, family, genus, or species. But there is a seemingly endless list of increasingly obscure ranks, sometimes only in use by one or a few authors (what is a “stirps”??); taxa can have multiple authors and different ways of presenting this (et/and/& and et al. or listing everyone); the names themselves can be spelled in different ways and include or omit initials; the year can be very unclear, especially for taxa published in older works published over multiple years; and the scientific names themselves can have different spellings, capitalizations, and hyphenations as well.

As with personal names, the ‘trick’ is to trust the source to some extent, and avoid focussing on the specifics of the scientific name. After all, there is no reason to make the computer understand the scientific names. The goal is to allow for matching (and there are ways to do that without trying to standardize everything) and, secondarily, to present the names in such a way that humans can understand them (and the original key already did that).

Measuring applicability (more problems)

Now, with a complete and structured taxonomic tree, the question becomes how to check whether it applies to a certain observation. The idea is to compare the taxon list to a list of all the species that a certain observation could be. But where does such a list come from? The simple solution would be a checklist: a professional, thorough list of all the species that are known to occur in a certain area.

If such a checklist is not (easily) available (I am not aware of a global database of checklists, let alone in a standardised format), it might be tempting to create one from a map of observations from e.g. GBIF. The advantage of this is that you have, to some extent, global coverage and a standardised format. However, this has some caveats. The correctness of the resulting checklist is entirely dependent on the quality and quantity of the observations in the region. To get a higher sensitivity it may be useful to instead make a checklist of a larger, encompassing region, but this lowers the specificity. A species occurring in Belgium might only have recorded observations in the Netherlands on GBIF. The same goes for the time scale. There might be museum specimens of species that are now extinct in the region, but there might also be rare species that are only seen every 50 years or so.

A good addition (or a risky alternative) to a checklist would be a measure of the (relative) abundance of species in the region, and weighing the taxon lists of the keys according to that. This prioritises keys that include more common species over keys that include more rare species. Of course, the key of common species can be wrong about an observation of a rare species. Another problem is how to determine the (relative) abundance. Again it is tempting to derive this from GBIF observations, but a species with a lot of observations is not necessarily more common. It might be the focus of an observation campaign, or national attention of a different kind, it might sit still more often, it might be easier to recognize, or it might even be easier to identify to a species level.

An additional possibility of measuring applicability is bringing the scope restrictions mentioned at the start into this. Apart from belonging to a certain taxon, the observed organism also has a biological sex, a life stage, and more characteristics that may restrict which keys can be used. To compare these characteristics between keys and organisms they need to be described with a common, consistent vocabulary. iNaturalist has such a vocabulary (the “Annotations”), but there might be more suitable ones out there somewhere.

(For this it might also be useful to map keys that distinguish castes of ants, males/females of solitary bees, or life stages of shield bugs. How do you model that? Not in the same way as described above, that is for sure.)

Using the data

As I teased in the previous blog post about the Library of Identification Resources, I have started to collect this data, and attempted to recruit some others to contribute as well. All works which have their keys (and matrices, checklists, descriptions) mapped out can be found by searching for “TRUE” in the “Tax. data extr.” field. To get from the work page to a taxa list, look for the row titled “Resources” in the first table on the page. If available it lists the individual resources in the work for which the taxonomic data is extracted.

The pages for the taxonomic data of the individual resources contains some basic information about the work (as well as a link), as well as the page numbers of the resource within the work if available. There is also some info on the resource, derived from the same info in the work unless different for that resource. Then there is a list of taxa, displayed as a tree. Each taxon has a rank, an anchor, and if available a link to the corresponding taxon in GBIF.

The data is also used in the proof-of-concept app that I introduced in the previous blog post. Searching for a taxon and coordinates will now query GBIF for observations in that taxon in the country encompassing the coordinates, and match this with the GBIF identifiers in the taxon lists. It displays the relative amount of taxa in the ‘checklist’ that are also in the key as a percentage and a small pie chart. It does not yet deal with synonyms.

Saturday, August 6, 2022

Library of Identification Resources

Since around this time last year, I have been working on creating a library of identification resources. Here, “identification resources” are identification keys, multi-access (matrix) keys, other works that can aid in the identification of species. The project is managed on GitHub: https://github.com/identification-resources.

Digital drawing of a black moth with red spots and light speckles on the forewings and red hindwings. Below the moth is a grey bar signifying a label, and the whole picture has a light beige background.
The logo: Zygaena filipendulae (Linnaeus, 1758)

About the project

Right now the project mainly consists of a catalog, a list of works (or rather “manifestations” in the FRBR model) that contain identification resources. The catalog contains bibliographic information, but also a summary of the identification resource(s) that they contain.

This summary consists of the starting taxon or taxa (e.g. Chrysididae), a region that the resource applies to (e.g. Europe), an optional scope (e.g. adults, males and females), the taxonomic rank(s) that the resource distinguishes (usually species), and whether the key was supposed to be complete for the region-scope combination at the time of publication.

This data is similar to that provided by BioInfo UK (example). They actually go in more detail, with more nuance in completeness, the equipment required for using the key (e.g. stereo microscope), whether and what specimen preparation is required, and expert reviews. The Library of Identification Resources does not include this information, but has a broader geographical scope¹ and can be searched in a more structural way.

The places (for the geographical scope of keys) and the authors and publishers linked to works as well as the works themselves are linked to Wikidata entities wherever possible. Journals and book series are linked to ISSNs.

Using the data

On top of this catalog, I built a website: https://identification-resources.github.io/. This provides a user interface, allowing for structured searches through the catalog, linking multiple versions (FRBR “manifestation”) of the same work (FRBR “work”), a few graphs exploring the distribution of languages, licenses, etc., and some additional functionality like exporting citations.

To improve searching, I am also working on an app that lists the most relevant resources for a given situation. In the proof-of-concept (GitHub) users can enter a parent taxon and a coordinate location and get (possibly) applicable resources for that situation. This uses the GBIF API for taxon searches, and the iNaturalist API for looking up places around coordinates.

Results are sorted according to a few heuristics, including:

  • The availability of full text.
  • How close the key taxon is: a key for Vespidae is likely better than one for Insecta, when identifiying Vespa.
  • How recent the resource is.
  • Whether the level of detail indicates that the key can actually improve on the parent taxon.
  • Whether the resource was considered complete during publication.
  • Whether the resource is a (matrix) key, just a reference or a photo gallery, or even a just collection of other resources (which are not guaranteed to contain anything relevant, and if they did, they would likely show up in the results already).

The data is currently biased towards the Netherlands and the rest of western Europe, as well as towards insects and mainly Hymenoptera. This is because I have mainly worked on adding resources that I was using, or resources referenced therein.

Screenshot of a website. The header has a logo of a moth, the text "Library of Identification Resources" and three links, "Catalog", "Statistics", and "GitHub". In the main part: in the top left is a paragraph of text; in the top right two input fields labeled "Taxon" and "Location" and a button labeled "Search"; on the left is a table listing items and on the right is an embedded PDF viewer
The proof-of-concept app in action.

Future plans

Improving the interface

I want to improve the interface of the proof-of-concept app and I have a number of ideas for this:

  • Allow users to “upload” photos. These would not be uploaded to a server but rather just displayed locally, but could serve a couple of purposes:
    • Automatic extraction of coordinates from EXIF data
    • Automatic determination of a parent taxon using computer vision [needed: a computer vision model :)]
    • Viewing the key side-by-side with the photos.
  • Same, but by entering a GBIF/iNaturalist observation ID.

Improving the data model

One of the main improvements that I want to make is modeling of individual keys/resources within works for two reasons:

  1. Works can contain multiple resources with completely different properties. If a work contains a key of Family A to Genus B and C, a key to the species of Genus B, and a key to the species of Genus C, those can be modeled as a single key to the species of Family A (assuming B and C are the only relevant genera). But this is not always the case. In addition, a work can contain resources of different types, like a key and a checklist (B159).
  2. Listing the taxa in the resource. I am aware of a few problems with this, but a list of species in a key can be matched to a modern checklist. This gives a better idea about how well a British key could apply to a Dutch observation (or an older British key to a more recent British observation, with all the species migration going on).

A screenshot of a collapsible taxonomic tree, where each taxon has a taxon name, a rank in grey text, a link icon and usually a GBIF icon
Example of what this would look like.

Contributing data

You are welcome to contribute data to the catalog. I plan to make this easier in the future, as currently the master copy of the data is still in the non-version-controlled spreadsheet that I started the work in.


¹ Note that the current data has a geographical bias, but this is not the same as a strict scope in my opinion.

Sunday, September 4, 2016

Conifer taxonomy - Part 2

Continuation of this post.

I got an answer quite quickly (but after posting the previous post):

The Plant List marks what species are in what genus and family, and groups families in Major Groups, e.g. gymnosperms. It also marks synonyms. With a list of conifer species and the ContentMine output, I can determine which species are not conifers, and find how they interact with each other. Now I only have a list of species and genera, without any context.

The site does not really answer the question we asked in the previous post, of how families are grouped inside the gymnosperm group, but I seem to have been mistaken on that part anyway. The page about gymnosperms does have some different ideas over how families are divided, but the Pinophyta page does not state that it isn't part of the gymnosperms. It just says it with different words. It seems to say that it is part of one of gymnosperms and angiosperms, not the group containing both.

Pine (own photo)

Anyway, that is not really important. I now have the information I want, on how species are grouped in genera and families. That is what I need, at least for now. I don't have to focus on things outside the conifer families, so I do not think I really have to know how they are divided. This concludes our quest for data for now, see my other blog posts for progress with the project.

Friday, August 19, 2016

Conifer taxonomy

Recently, I tried to find out the exact taxonomy of conifers. I knew that a few years earlier, when I was actively working with it, there were a few issues on Wikipedia concerning the grouping of the main conifer families, namely Araucariaceae, Cephalotaxaceae, Cupressaceae, Pinaceae, Podocarpaceae, Sciadopityaceae, Taxaceae, and actually the grouping of genera in families as well. Guess what changed: not much, not on Wikipedia anyway. The disagreement between Wikipedia pages in different laguages is one thing, but the English pages were contradicting each other pretty heavily. Not even mentioning the position of gymnosperms in the plant hierarchy, which looks even worse. Some examples:

Subclass Cycadidae

  • Order Cycadales
    • Family Cycadaceae: Cycas
    • Family Zamiaceae: Dioon, Bowenia, Macrozamia, Lepidozamia, Encephalartos, Stangeria, Ceratozamia, Microcycas, Zamia.

Subclass Ginkgoidae

Subclass Gnetidae

Subclass Pinidae

  • Order Pinales
    • Family Pinaceae: Cedrus, Pinus, Cathaya, Picea, Pseudotsuga, Larix, Pseudolarix, Tsuga, Nothotsuga, Keteleeria, Abies
  • Order Araucariales
    • Family Araucariaceae: Araucaria, Wollemia, Agathis
    • Family Podocarpaceae: Phyllocladus, Lepidothamnus, Prumnopitys, Sundacarpus, Halocarpus, Parasitaxus, Lagarostrobos, Manoao, Saxegothaea, Microcachrys, Pherosphaera, Acmopyle, Dacrycarpus, Dacrydium, Falcatifolium, Retrophyllum, Nageia, Afrocarpus, Podocarpus
  • Order Cupressales
    • Family Sciadopityaceae: Sciadopitys
    • Family Cupressaceae: Cunninghamia, Taiwania, Athrotaxis, Metasequoia, Sequoia, Sequoiadendron, Cryptomeria, Glyptostrobus, Taxodium, Papuacedrus, Austrocedrus, Libocedrus, Pilgerodendron, Widdringtonia, Diselma, Fitzroya, Callitris (incl. Actinostrobus and Neocallitropsis), Thujopsis, Thuja, Fokienia, Chamaecyparis, Callitropsis, Cupressus, Juniperus, Xanthocyparis, Calocedrus, Tetraclinis, Platycladus, Microbiota
    • Family Taxaceae: Austrotaxus, Pseudotaxus, Taxus, Cephalotaxus, Amentotaxus, Torreya

Subclasses of the division gymnosperms, according to the Wikipedia page of gymnosperms

Ginkgo biloba leaves (source: inbetweenbays, CC-BY 4.0, via iNaturalist)

In the text above, we see three of the subclasses of gymnosperms contain "traditional" conifers, and three containing related species. A different page, however, talks about a division called Pinophyta, and a class called Pinopsida, where all "traditional" conifers are located. Recapping, the division gymnosperms contain conifers AND some other things like the Gingkgo and cycads. The division Pinophyta, contains ONLY conifers. This could be possible, if Pinophyta was a part of gymnosperms, but they're both divisions, so, as far as I know, they should be on the same taxon level. And the Wikipedia pages do not indicate anything else

Now, my knowledge of taxonomy may not be perfect, but this doesn't seem right. So, I tweeted Ross Mounce, who has been busy with making phylogenetic trees.

To be continued...