Monday, March 29, 2021

CDLI catalogue growth over time

Since Google Summer of Code 2020 I have been contributing code to the new framework of the Cuneiform Digital Library Initiative (CDLI), a digital repository for metadata, transliterations, images, and other data of cuneiform inscriptions, as well as tools to work with that data.

One of the features of the new CDLI framework is improved editing of artifact metadata, as well as inscriptions. There are several ways artifacts will be able to be edited: by uploading metadata in a CSV-format, by batch edits in the website interface, and by editing individual artifacts. At the basis of all those pathways is the database representation of individual revisions. To evaluate the planned representation, and to see if alternatives are possible, I took a look at the catalogue metadata (CDLI 2021), and what edits are being made currently.

Artifact registrations


Figure 1: The number of artifacts registered each year, split by the collection they are in. Collections with less than 7000 artifacts represented are combined into the “Other” category to keep the legend manageable. 13K artifacts without valid creation dates were excluded.

First, I took a quick look at the composition of the catalogue (Fig. 1). As it turns out some collections had most of their (included) artifacts added in a single year, such as the University of Pennsylvania Museum of Archaeology and Anthropology in 2005 and the Anadolu Medeniyetleri Müzesi in 2012. Other collections seemed to have had a more steady flows of artifact entries, particularly the British Museum. Overall, this does not help much with selecting a database representation of revisions though.

One of the options we want to evaluate is storing revisions in a linked format, similarly to how artifacts are stored now, instead of a flat format. This means that if each of those artifacts has about 10 links in total — say 1 material, 1 genre, 1 holding collection, 1 language, 1 associated composite, and 5 publications — each revision would need 10 rows for the links and 1 row for the rest of the entry. Therefore, the question is: are 11 rows per revision manageable for 343,371 active artifacts?

Revisions

To find out, let’s take a look at the daily updates of the catalogue data. With the git history, we can find out how many artifacts were edited on each day. Since the commits are made daily multiple consecutive edits to the same artifact are counted as a single revision. On the other hand, the removal of an artifact from the catalogue might be counted as a revision. Whether that balances out is hard to tell, so these numbers are a rough estimate. The analysis only goes back to 2017 unfortunately, as before that the catalogue was included as a ZIP file.


Figure 2: Number of artifact revisions per year. The 12 largest revisions are highlighted and explained below.

Figure 2 highlights in various colors the 12 revisions affecting the highest number of artifacts. Most of these are 7 consecutive in October and November of 2017. These involved editing the ID column, something which should not happen in the current system. Other large revision usually affected a single column, thereby revising almost every artifact:

  • Padding designation numbers with leading zeroes (“CDLI Lexical 000030, ex. 13” → “013”): 2020-02-11, 2018-01-26
  • Addition of new columns: 2019-09-10
  • New default values ("" → “no translation” for the column translation_source): 2018-10-30

Outside the top 12 the edits become a lot more meaningful. Often, credits for new transliterations or translations are added, sometimes with the previously undetermined language now specified.

As it turns out, approximately 3 million edits are made every year. If all those edits are stored as linked entities, we are looking at 30 million table rows, per year. However, even if the edits are stored in a flat format there would be 3 million table rows per year already. Either option might become a problem in 5–10 years, depending on SQL performance. With indexing it might not be a problem at all: usually the only query is by identifier anyway.

Changed fields

That said, let’s presume I choose the flat format. Most revisions only change one or two fields (excluding the modification date which would not be included). Duplicating the whole row might be wasteful, so what could I do to avoid that?

Since the flat format is based on the schema of the CSV files used for batch edits, each column can be considered as text, with an empty string for empty values. This leaves NULL (i.e. “no value”) available to represent “no change”. Together with MySQL’s SPARSE columns only edited fields would be stored. (Otherwise, each empty value would need to be signified as such. Now, actual values carry some extra information to the same end.)

It would also make it even easier to display a list of changes, as there is no need to compare the value with the previous one. Other operations with revisions, such as merging two revisions made simultaneously on the same version of an artifact, would be easier for the same reason.

Since this would not be possible, or not as easy, with a linked format perhaps it was good the shear volume of edits pointed me that way anyway.

References

Cuneiform Digital Library Initiative. (2021, March 8). Monthly release 2021.03 (Version 2021.03). Zenodo. http://doi.org/10.5281/zenodo.4588551

Friday, March 26, 2021

GitHub pages 404 redirection

Recently I moved the Citation.js API documentation from /api to /api/0.3, to put the new documentation on /api/0.5. I fixed all the links to the documentation, but I still got a issue request regarding a 404 error after just a few days. All in all, I had to redirect pages from /api/* to /api/0.3/* while all these pages are hosted as static files on GitHub Pages.

There are three ways I found to do this:

  1. I make otherwise empty HTML files in /api/* that redirect to /api/0.3/* via JavaScript or a <meta> tag.
  2. I make use of jekyll-redirect-from. This is equivalent to option 1, I think.

Option 1 seemed like a hassle and I do not use Jekyll so option 2 seemed out of the question as well. However, we still have option 3 to consider:

  1. I add a 404.html to the repository which gets served automatically on a 404. It then redirects to /api/0.3/* with JavaScript, and gives guidance on how to find the new URL manually if JavaScript is disabled.

404.html is just a normal 404 page with 4 lines of JavaScript:

var docsPattern = /(\/api)(\/(?!0.[35]\/)|$)/  
  
if (docsPattern.test(location.pathname)) {  
    location.pathname = location.pathname.replace(docsPattern, '$1/0.3$2')  
}

Breaking down the RegExp pattern:

  • (\/api) matches “/api” in the URL
  • (\/(?!0.[35]\/)|$) matches one of two things, immediately after “/api”
    • Either $, the end of the string (like “https://citation.js.org/api” without the trailing slash)
    • Or \/(?!0.[35]\/), which matches a forward slash ("/api/") followed by anything except “0.3” or “0.5”. This is to avoid matching things like “/apical/” or “/api/0.3/does-not-exist”.

This is not the neatest solution but I like it conceptually. It shows a bit of potential for Single-Page Applications as well: you can serve the same HTML+JavaScript for every possible path without having to rely on URLs like https://example.org/#/path/page. The problem is that you still get the 404 HTTP status (as you should), so if a browser or search crawler decides to care you have a problem.

Try it out now: https://citation.js.org/api/

The new "Page not Found" page in the same style as the homepage.