Since Google Summer of Code 2020 I have been contributing code to the new framework of the Cuneiform Digital Library Initiative (CDLI), a digital repository for metadata, transliterations, images, and other data of cuneiform inscriptions, as well as tools to work with that data.
One of the features of the new CDLI framework is improved editing of artifact metadata, as well as inscriptions. There are several ways artifacts will be able to be edited: by uploading metadata in a CSV-format, by batch edits in the website interface, and by editing individual artifacts. At the basis of all those pathways is the database representation of individual revisions. To evaluate the planned representation, and to see if alternatives are possible, I took a look at the catalogue metadata (CDLI 2021), and what edits are being made currently.
Artifact registrations
Figure 1: The number of artifacts registered each year, split by the collection they are in. Collections with less than 7000 artifacts represented are combined into the “Other” category to keep the legend manageable. 13K artifacts without valid creation dates were excluded.
First, I took a quick look at the composition of the catalogue (Fig. 1). As it turns out some collections had most of their (included) artifacts added in a single year, such as the University of Pennsylvania Museum of Archaeology and Anthropology in 2005 and the Anadolu Medeniyetleri Müzesi in 2012. Other collections seemed to have had a more steady flows of artifact entries, particularly the British Museum. Overall, this does not help much with selecting a database representation of revisions though.
One of the options we want to evaluate is storing revisions in a linked format, similarly to how artifacts are stored now, instead of a flat format. This means that if each of those artifacts has about 10 links in total — say 1 material, 1 genre, 1 holding collection, 1 language, 1 associated composite, and 5 publications — each revision would need 10 rows for the links and 1 row for the rest of the entry. Therefore, the question is: are 11 rows per revision manageable for 343,371 active artifacts?
Revisions
To find out, let’s take a look at the daily updates of the catalogue data. With the git history, we can find out how many artifacts were edited on each day. Since the commits are made daily multiple consecutive edits to the same artifact are counted as a single revision. On the other hand, the removal of an artifact from the catalogue might be counted as a revision. Whether that balances out is hard to tell, so these numbers are a rough estimate. The analysis only goes back to 2017 unfortunately, as before that the catalogue was included as a ZIP file.
Figure 2: Number of artifact revisions per year. The 12 largest revisions are highlighted and explained below.
Figure 2 highlights in various colors the 12 revisions affecting the highest number of artifacts. Most of these are 7 consecutive in October and November of 2017. These involved editing the ID column, something which should not happen in the current system. Other large revision usually affected a single column, thereby revising almost every artifact:
- Padding designation numbers with leading zeroes (“CDLI Lexical 000030, ex. 13” → “013”): 2020-02-11, 2018-01-26
- Addition of new columns: 2019-09-10
- New default values ("" → “no translation” for the column
translation_source
): 2018-10-30
Outside the top 12 the edits become a lot more meaningful. Often, credits for new transliterations or translations are added, sometimes with the previously undetermined language now specified.
As it turns out, approximately 3 million edits are made every year. If all those edits are stored as linked entities, we are looking at 30 million table rows, per year. However, even if the edits are stored in a flat format there would be 3 million table rows per year already. Either option might become a problem in 5–10 years, depending on SQL performance. With indexing it might not be a problem at all: usually the only query is by identifier anyway.
Changed fields
That said, let’s presume I choose the flat format. Most revisions only change one or two fields (excluding the modification date which would not be included). Duplicating the whole row might be wasteful, so what could I do to avoid that?
Since the flat format is based on the schema of the CSV files used for batch edits, each column can be considered as text, with an empty string for empty values. This leaves NULL
(i.e. “no value”) available to represent “no change”. Together with MySQL’s SPARSE
columns only edited fields would be stored. (Otherwise, each empty value would need to be signified as such. Now, actual values carry some extra information to the same end.)
It would also make it even easier to display a list of changes, as there is no need to compare the value with the previous one. Other operations with revisions, such as merging two revisions made simultaneously on the same version of an artifact, would be easier for the same reason.
Since this would not be possible, or not as easy, with a linked format perhaps it was good the shear volume of edits pointed me that way anyway.
References
Cuneiform Digital Library Initiative. (2021, March 8). Monthly release 2021.03 (Version 2021.03). Zenodo. http://doi.org/10.5281/zenodo.4588551
No comments:
Post a Comment