Showing posts with label programming. Show all posts
Showing posts with label programming. Show all posts

Tuesday, December 31, 2024

Citation.js: 2024 in review

This past year was relatively quiet for Citation.js as well.


Ulex europaeus, observed December 24th, 2024, Vlieland, The Netherlands.

Changes

  • BibTeX: output of non-ASCII characters was improved.
  • BibLaTeX: support for data annotations was added!
  • DOI: the DOI pattern was broadened to include non-standard DOI formats.
  • Support for ORCIDs was improved, making it possible to map authors’ ORCIDs to different formats.

New Year’s Eve tradition

After the releases on New Year’s Eve of 2016, 2017, 2021, 2022, and 2023, this New Year’s Eve also brings the new v0.7.17 release. The CSL field publisher is now mapped to the BibTeX field organization for paper-conference (inproceedings) entries.

Happy New Year!

Tuesday, October 15, 2024

Next.js, SWC, and citeproc-js

Last year I got a bug report that Citation.js was not working when built in a Next.js production environment for unclear reasons. Next.js is a popular server framework to make web applications with React, and by default transforms all JavaScript files and their dependencies into “chunks” to improve page load times. In production environments, Next.js uses the Rust-based “Speedy Web Compiler” SWC to optimize and minify JavaScript code.

I was able to figure out that somewhere, this process transformed an already difficult-to-grok function (makeRegExp) in the citeproc dependency into actually broken code. After some trial and error I found the following MCVE (Minimal Complete Verifiable Example):

function foo (bar) {
    var bar = bar.slice()
    return bar
}

foo(["bar"])

// equivalent to

function foo (bar) {
    var bar // a no-op in this case, apparently
    bar = bar.slice()
    return bar
}

foo(["bar"])

But then, in the chunks generated by Next.js, the argument bar gets optimized away from foo(), generating the following code (it also inlines the function).

var bar;
bar = bar.slice();

Now, this is a simple mistake to make. If you expect var bar to actually re-declare the bar argument, the argument is clearly unused and can be removed. Due to the quirks of JavaScript that is not the case though, and the incorrect assumption leads to incorrect code.

This is not a one-off thing though: last August I got another, similar bug report with the same cause: some slightly non-idiomatic code (CSL.parseXml) from citeproc got mis-compiled by SWC. I found another MCVE:

function foo (arg) {
    const a = { b: [] }
    const c = [a.b]
    c[0].push(arg)
    return a.b[0]
}

The compiler misses that c[0] refers to the same object as a.b and thinks that makes the function a no-op, though it does not optimize it away fully, instead producing the following:

function (n) {
    return [[]][0].push(n), [][0]
}

This was apparently already noticed and fixed last May though the SWC patch still has to land in a stable version of Next.js. Interestingly, the patch includes a test fixture that uses CSL.parseXml as example code; apparently citeproc is a good stress-test of JavaScript compilers.

This is all fine with me, I am not going to blame the maintainers of a complex open-source project like SWC for occasional bugs. However, I would like to see a popular framework like Next.js, with 6.5 million downloads per week and corporate backing, to do more testing for such essential parts of their infrastructure. I also do not see them among the sponsors of SWC.

Edited 2024-10-15 at 17:26: Actually, the creator for SWC is also a maintainer for Next.js, though I do not know in which order. Given that, it makes more sense that they switched away from the well-tested but slower BabelJS in version 12, and more confusing why they did not test it a bit more thoroughly.

Saturday, December 31, 2022

Citation.js: 2022 in review

Following up on the previous two updates this year (Version 0.5 and a 2022 update and Version 0.6: CSL v1.0.2), here are some updates from the second half of 2022, as well as some statistics.


Sapygina decemguttata, a small wasp, observed July 8th, 2021

Changes

  • The mappings of the Wikidata plugin were updated, especially to accomodate for software.
  • The default-locale setting of some CSL styles is now respected.

New plugins

New users

  • WikiPathways is in the process of moving to a new site which uses Citation.js to generate citations.
  • Page, R. D. (2022). Wikidata and the bibliography of life. PeerJ, 10, e13712. 10.7717/peerj.13712
  • Boer, D. (2021). A Novel Data Platform for Low-Temperature Plasma Physics. [Master’s thesis] Radboud University, Nijmegen, The Netherlands. URL

Statistics

@citation-js/core was downloaded approximately 240,000 times in 2022, against 205,000 times in 2021. The legacy package citation-js was downloaded approximately 160,000 times (compared to 190,000 times in 2021). This is good, as it indicates people are starting to use the new system more often. Note that most downloads of citation-js also lead to a download of @citation-js/core so most people still use the legacy package. The shift to stop using citation-js seemed to have started in October 2021, which coincides with the time time I forgot to update that package for 2 months after releasing v0.5.2 of @citation-js/core.

Since February 2022 a relative decrease in downloads of @citation-js/plugin-wikidata compared to core is also visible. In December the Wikidata plugin was downloaded more than 50% fewer times, in fact. This is also good and the exact point of the modularisation introduced back in 2018: to let users choose which formats to include. The DOI plugin was similarly less popular, while the BibTeX and CSL plugins were — as expected — almost always included. Surprisingly, BibJSON, a very niche format, only had a 25% reduction (maybe due to confusion with BibTeX and the internal BibTeX JSON format), while RIS, after BibTeX the most common format, had a 50% reduction as well.

New Year’s Eve tradition

After the releases on New Year’s Eve of 2016, 2017, and 2021, this New Year’s Eve also brings the new v0.6.5 release which changes the priority of some RIS fields in very specific situations, most notably the date field of conference papers (now using DA instead of C2).

Happy New Year!

Tuesday, May 31, 2022

Citation.js Version 0.6: CSL v1.0.2

Since the citation-js npm package was first published, version 0.6 is the first major version of Citation.js that did not start out as a pre-release. Version 0.3 itself spent almost 6 months in pre-release, but only received updates for less than half a month. Version 0.4 spent more than a year in pre-release and received updates for about 4 months. Version 0.5 takes the cake with one and a half years in pre-release, receiving updates for a year, also making it the best-maintained version.

Yellow flowers with lots of little "rays" on greenish brown stems

Tussilago farfara, March 27th, 2022

Version 0.6 is a major version bump because it introduces a number of breaking changes, including raising the minimal Node.js version to 14. Since April 2022, Node.js 12 is End-Of-Life, which led to a lot of dependencies dropping support. Now, Citation.js does so too. Other changes include the following:

Update data format to CSL v1.0.2

The internal data format is now updated from CSL v1.0.1 to v1.0.2. This introduces the software type and the generic document type, as well as some other types, and some new fields. The event field is also renamed to event-title. That, and software replacing book, makes it so that CSL v1.0.2 is not compatible with CSL v1.0.1 styles, making it a breaking change.

  • CSL data is now automatically upgraded to v1.0.2 on input.
  • Cite#data ((new Cite()).data) now contains CSL v1.0.2 data.
  • Output formatters of plugins now receive CSL v1.0.2 data as input.
  • util (import { util } from '@citation-js/core') now has two functions, downgradeCsl and upgradeCsl, to convert data between the two versions.
  • The data formatter (.format('data')) now takes a version option. When set to '1.0.1', this downgrades the CSL data before outputting.
  • @citation-js/plugin-csl already automatically downgrades CSL to v1.0.1 for compatibility with the style files.
  • Custom fields are now generally put in the custom object, instead of prefixing an underscore to the field name.

The mappings are also updated. Especially the RIS and BibLaTeX mappings were made more complete by the increased capabilities of the CSL data schema. Non-core plugins are also being updated, mainly affecting @citation-js/plugin-software-formats and @citation-js/plugin-zotero-translation-server.

Test coverage

While updating the plugin mappings, the test suites of the plugins were also expanded. This led to the identification of a number of bugs, that were also fixed in this release:

  • BibLaTeX
    • handling of CSL entries without a type
    • handling of bookpagination
    • handling of masterthesis
  • RIS
    • RegExp pattern for ISSNs
    • Name parsing of single-component names

Closing issues

A number of issues were also fixed in this release:

  • Adding full support for the Bib(La)TeX crossref field
  • Mapping BibLaTeX eid to number instead of page
  • Adding a mapping for the custom BibLaTeX s2id field
  • In Wikidata, getting issue/volume/page/publication date info from qualifiers as well as top-level properties.

CSL styles

The bundled styles (apa, vancouver, and harvard1) were updated. Note that harvard1 is now an alias for harvard-cite-them-right. Quoting the documentation:

The original “harvard1.csl” independent CSL style was not based on any style guide, but it nevertheless remained popular because it was included by default in Mendeley Desktop. We have now taken one step to remove this style from the central CSL repository by turning it into a dependent style that uses the Harvard author-date format specified by the popular “Cite Them Right” guide. This dependent style will likely be removed from the CSL repository entirely at some point in the future.
http://www.zotero.org/styles/harvard1, CC-BY-SA-3.0

Looking forward

Some breaking changes are still pending, mainly changes to the plugin API and the removal of some left-over APIs. However, I also want to work on a more comprehensive format for machine-readable mappings, a format for mappings for linked data, and of course implementing more mappings in general!

Monday, May 30, 2022

A story about a university login with a broken security configuration, and a mildly uncooperative help desk

Last semester I followed some courses at a different university, and went through the process of collecting login credentials and multi-factor authentication tokens and familiarizing myself with a network of university systems all over again. Most (but not all) of those systems use the main single sign-on login process of the university, at https://login.universityfoo.nl.

Note
The university has two main domains, let’s call them universityfoo.nl and universitybar.nl.

One of those systems is Brightspace, used by course coordinators to communicate course information, syllabi, and additional documents to students. Very important for someone new to the university, especially someone who did not go through the normal process of introduction weeks and tutorials. But when I logged in at https://login.universityfoo.nl, I was met with a blank screen. Other systems worked fine, including those to set up my email, but Brightspace did not.

Naturally I opened the trusted Chrome DevTools and saw the following error:

Refused to navigate to 'https://brightspace.universitybar.nl/d2l/lp/auth/login/samlLogin.d2l' because it violates the following Content Security Policy directive: "navigate-to 'self' https://*.universityfoo.nl:443 https://*.services.universitybar.nl:443".

That was already pretty clear: one of the Content Security Policy directives was simply blocking any navigation to any domains other than a short list of exceptions, which did not include the domain that Brightspace was on. But that seems like a major problem, one that would have been caught already unless I had some incredibly (un)lucky timing.

In the background, a Chrome window with a single tab showing a blank page. In the foreground, a Chrome DevTools showing the error mentioned above.

It turned out, however, that specifically the navigate-to directive was not supported yet at all, in any browser, at least according to MDN. However, in the Chromium code the following could be found:

// Content counterpart of ExperimentalContentSecurityPolicyFeatures in
// third_party/blink/renderer/platform/runtime_enabled_features.json5. Enables
// experimental Content Security Policy features ('navigate-to' and
// 'prefetch-src').
public static final String EXPERIMENTAL_CONTENT_SECURITY_POLICY_FEATURES = "ExperimentalContentSecurityPolicyFeatures";

Turns out I had the #enable-experimental-web-platform-features flag enabled, for some reason, and that flag probably included the EXPERIMENTAL_CONTENT_SECURITY_POLICY_FEATURES. I probably enabled the flag for development at some point? I do not even remember. But that meant the navigate-to directive was just wrong.

Since I did not want to disable the flag (or were not sure whether it would help), I instead turned to ModHeader: a Chrome web extension to modify requests and responses in the browser. I mainly use it to view DOI content negotiation requests in the browser instead of using cURL. With that I could modify the navigate-to part of the Content-Security-Policy header to the following (line breaks and [...] mine):

Content-Security-Policy: [...] navigate-to 'self'
https://*.universityfoo.nl:443
https://*.services.universitybar.nl:443
https://brightspace.universitybar.nl:443; [...]

This finally allowed me to log in to Brightspace.

Naturally I wanted to share my findings, especially since whenever navigate-to gets support without experimental flags, Brightspace log in breaks for everyone, so I went to the online university helpdesk. There, I was also met with a blank page. Imagine that. Suddenly logging in to Brightspace does not work anymore, and all the students going to the digital helpdesk are met with a blank page as well. Students panicking, the IT department (maybe) panicking because they were not doing any upgrades or maintenance or anything. Good thing I got a sneak preview of the problem, so I could warn them. First, bypassing navigate-to for the helpdesk as well:

Content-Security-Policy: [...] navigate-to 'self'
https://*.universityfoo.nl:443
https://*.services.universitybar.nl:443
https://brightspace.universitybar.nl:443
https://helpdesk.universitybar.nl:443; [...]

However, when I sent a message detailing the problem, I was met with “can you try clearing your cache?” I did, even though I knew that was not the problem, and it did not help. I did know what would help though, but they clearly did not care since I am reproducing the problem while writing the blog post almost 9 months later. When I confirmed that clearing the cache did not help, I was asked to disable #enable-experimental-web-platform-features. Which, sure, but that was not really the point. Anyway, I guess they will probably find out in time anyway, but I was still a bit disappointed.

Friday, May 27, 2022

Citation.js Version 0.5 and a 2022 update

Version 0.5.0

Version 0.5.0 of Citation.js was released on April 1st, 2021.

BibTeX and BibLaTeX

After the update to the Bib(La)TeX file parser, described in the earlier BibTeX Rework: Syntax Update blog post, the mapping of BibTeX and BibLaTeX data to CSL-JSON was also updated. The mapping is now split in two, one for BibLaTeX (which is backwards-compatible with BibTeX) and one for BibTeX. The output formats were also updated to output either BibTeX-compatible files or BibLaTeX-compatible files. The most common difference there is the use of year and month versus date respectively. In addition, a number of updates were made to the file parser.

Core changes

In the Grammar utility class, bugs were fixed an behavior was updated to better account for the Bib(La)TeX parser. Some of the code for correcting CSL-JSON was also updated, including moving the code correcting results of the Crossref API from the DOI plugin to the core module as CSL-JSON from the API may end up in Citation.js through other methods than the DOI plugin. Earlier in 0.5 development, some of the HTTP handling code was also updated for increased stability.

2022 update

v0.5.1v0.5.7

The version released since v0.5.0 mostly contain bug fixes and small enhancements. The latter includes some more descriptive errors in certain places, as well as mapping some non-standard fields in Bib(La)TeX and RIS.

New site design

The design of the Citation.js site was updated for the first time since 2018. The changes were detailed in the recent Citation.js: New site blog post.

New plugins

New plugins for the refer file format (plugin-refer) and the RefWorks tagged format (plugin-refworks) were released.

More coming

More changes are expected, including more long-awaited output formats, better mappings for software and datasets, and more work on machine-readable mappings.

Citation.js: New site

Citation.js: New site

I recently updated the website of Citation.js. This involved getting rid of the Material Design Lite framework, simplifying and refreshing the site design and modernisering some of the code behind it. Additionally, I updated the content of the homepage, and added some functionality to the interface of the blog page and the demo.

Homepage

The old layout of the homepage had a dark grey background, with in the middle a grid of four cards with the main content of the site, and between the top and bottom row the Citation.js banner-variant logo. The grid of cards had a background of syntax-highlighted source code. This is actually the start of the Citation.js v2 code, which at that point still consisted of a single file. On the very top of the page was a yellow header and in the bottom a thin black footer.

The new layout incorporates a lot of the design elements of the first design, but in a way that hopefully improves the readability and feel of the page. The yellow header remains but the links are centered instead of right-aligned. The footer is full-width (though the text is still centered) and has a larger font size and vertical padding. The grid is gone, instead the top of the page has a background of code in full-width with the banner logo and some introductory text. The other content is now aligned in a single row, and the cards are replaced with plain text, although the headers still have white text with a slight shadow on a dark background.

Blog

The blog page had the same header, footer, and dark grey background in the old layout, with individual blog posts as cards and the introductory text and search bar as a slightly wider card.

The new layout mirrors the changes to the homepage, especially the white background and full-width code background and the changing of cards to plain text. To the right of the blog content is now a sidebar listing the blog posts per year, which moves to the bottom of the page on narrow screens. Below the search bar is now a clickable list of tags.

Demo page

The design of the demo page has not been updated since I made it in April 2016, being more or less plain-text but with paragraphs limited in width and centered.

The new design adds the header, footer, and code background from the homepage as well as some styles for the headers. The interface of the demo is simplified at the cost of easy-to-read code. That also means that the live view of the code is removed.

API documentation

The styles of the home page now also apply to the API documentation.

Thursday, December 2, 2021

Re-implementing the upload of images for the LaTeX→HTML converter

The CDLI is developing a new website. That website’s admin interface for its journals contains a page where a LaTeX source file, following a specific template, is configured to an HTML page. For this, apart from the LaTeX file itself, two additional components are needed: a BibTeX file, containing metadata of the references; and image files.

Current implementation

The current implementation involves uploading images separately from the form that creates the article. This can lead to “orphan” images if such a form is abandoned after uploading the images. The current implementation has an additional problem, where files are saved in a single directory, so multiple images with the same file name (say, Figure1.jpg) will overwrite previous images.

ClientServerGET /admin/articles/add/cdlj200 (add page)POST /admin/articles/convert-latex200 (converted HTML, list of images)POST /admin/articles/image 'Figure1.jpg'Saves image in 'Figure1.jpg'200POST /admin/articles/add/cdlj300 /articles/<sequential ID>ClientServer

New implementation

With a new implementation of the rest of the forms, simplifying a lot of the code, the issues with the image uploads are also attempted to be resolved.

Saving images together with the metadata

To avoid the problem of the “orphan” images resulting from abandoned forms, one could submit the images in the same form that creates the article. If that form is not submitted, or contains invalid data so cannot be saved, the images will not be uploaded.

Saving images according to metadata

The second problem could be solved immediately by saving the images in subdirectories according to the metadata of the image, e.g. 2021-01/ where 2021 is the year the article is published and 01 is the article sequence number within that year. This however assumes that that metadata does not change after the initial submission.

Both these solutions however creates some constraints on other problems, because it means that the images can only be saved after the user submits the main form, so after the HTML containing <img> elements with references to the image locations is generated. Somehow, those image locations should be able to identify the article, before any information about that article is known:

ClientServerGET /admin/articles/add/cdlj200 (add page)POST /admin/articles/convert-latexAt this point the HTMLshould contain linksto the permanent locations of the images200 (converted HTML, list of images)POST /admin/articles/add/cdljAt this pointthe images are savedin a permanent location300 /articles/<sequential ID>ClientServer

So, what is the solution? I propose the following:

ClientServerGET /admin/articles/add/cdljGenerate random article ID200 (add page with embedded article id)POST /admin/articles/convert-latexGenerate image URLSaccording torandom article ID200 (converted HTML, list of images)POST /admin/articles/add/cdljVerify random article ID,save images,generate sequential ID300 /articles/<sequential ID>ClientServer

Monday, March 29, 2021

CDLI catalogue growth over time

Since Google Summer of Code 2020 I have been contributing code to the new framework of the Cuneiform Digital Library Initiative (CDLI), a digital repository for metadata, transliterations, images, and other data of cuneiform inscriptions, as well as tools to work with that data.

One of the features of the new CDLI framework is improved editing of artifact metadata, as well as inscriptions. There are several ways artifacts will be able to be edited: by uploading metadata in a CSV-format, by batch edits in the website interface, and by editing individual artifacts. At the basis of all those pathways is the database representation of individual revisions. To evaluate the planned representation, and to see if alternatives are possible, I took a look at the catalogue metadata (CDLI 2021), and what edits are being made currently.

Artifact registrations


Figure 1: The number of artifacts registered each year, split by the collection they are in. Collections with less than 7000 artifacts represented are combined into the “Other” category to keep the legend manageable. 13K artifacts without valid creation dates were excluded.

First, I took a quick look at the composition of the catalogue (Fig. 1). As it turns out some collections had most of their (included) artifacts added in a single year, such as the University of Pennsylvania Museum of Archaeology and Anthropology in 2005 and the Anadolu Medeniyetleri Müzesi in 2012. Other collections seemed to have had a more steady flows of artifact entries, particularly the British Museum. Overall, this does not help much with selecting a database representation of revisions though.

One of the options we want to evaluate is storing revisions in a linked format, similarly to how artifacts are stored now, instead of a flat format. This means that if each of those artifacts has about 10 links in total — say 1 material, 1 genre, 1 holding collection, 1 language, 1 associated composite, and 5 publications — each revision would need 10 rows for the links and 1 row for the rest of the entry. Therefore, the question is: are 11 rows per revision manageable for 343,371 active artifacts?

Revisions

To find out, let’s take a look at the daily updates of the catalogue data. With the git history, we can find out how many artifacts were edited on each day. Since the commits are made daily multiple consecutive edits to the same artifact are counted as a single revision. On the other hand, the removal of an artifact from the catalogue might be counted as a revision. Whether that balances out is hard to tell, so these numbers are a rough estimate. The analysis only goes back to 2017 unfortunately, as before that the catalogue was included as a ZIP file.


Figure 2: Number of artifact revisions per year. The 12 largest revisions are highlighted and explained below.

Figure 2 highlights in various colors the 12 revisions affecting the highest number of artifacts. Most of these are 7 consecutive in October and November of 2017. These involved editing the ID column, something which should not happen in the current system. Other large revision usually affected a single column, thereby revising almost every artifact:

  • Padding designation numbers with leading zeroes (“CDLI Lexical 000030, ex. 13” → “013”): 2020-02-11, 2018-01-26
  • Addition of new columns: 2019-09-10
  • New default values ("" → “no translation” for the column translation_source): 2018-10-30

Outside the top 12 the edits become a lot more meaningful. Often, credits for new transliterations or translations are added, sometimes with the previously undetermined language now specified.

As it turns out, approximately 3 million edits are made every year. If all those edits are stored as linked entities, we are looking at 30 million table rows, per year. However, even if the edits are stored in a flat format there would be 3 million table rows per year already. Either option might become a problem in 5–10 years, depending on SQL performance. With indexing it might not be a problem at all: usually the only query is by identifier anyway.

Changed fields

That said, let’s presume I choose the flat format. Most revisions only change one or two fields (excluding the modification date which would not be included). Duplicating the whole row might be wasteful, so what could I do to avoid that?

Since the flat format is based on the schema of the CSV files used for batch edits, each column can be considered as text, with an empty string for empty values. This leaves NULL (i.e. “no value”) available to represent “no change”. Together with MySQL’s SPARSE columns only edited fields would be stored. (Otherwise, each empty value would need to be signified as such. Now, actual values carry some extra information to the same end.)

It would also make it even easier to display a list of changes, as there is no need to compare the value with the previous one. Other operations with revisions, such as merging two revisions made simultaneously on the same version of an artifact, would be easier for the same reason.

Since this would not be possible, or not as easy, with a linked format perhaps it was good the shear volume of edits pointed me that way anyway.

References

Cuneiform Digital Library Initiative. (2021, March 8). Monthly release 2021.03 (Version 2021.03). Zenodo. http://doi.org/10.5281/zenodo.4588551

Friday, March 26, 2021

GitHub pages 404 redirection

Recently I moved the Citation.js API documentation from /api to /api/0.3, to put the new documentation on /api/0.5. I fixed all the links to the documentation, but I still got a issue request regarding a 404 error after just a few days. All in all, I had to redirect pages from /api/* to /api/0.3/* while all these pages are hosted as static files on GitHub Pages.

There are three ways I found to do this:

  1. I make otherwise empty HTML files in /api/* that redirect to /api/0.3/* via JavaScript or a <meta> tag.
  2. I make use of jekyll-redirect-from. This is equivalent to option 1, I think.

Option 1 seemed like a hassle and I do not use Jekyll so option 2 seemed out of the question as well. However, we still have option 3 to consider:

  1. I add a 404.html to the repository which gets served automatically on a 404. It then redirects to /api/0.3/* with JavaScript, and gives guidance on how to find the new URL manually if JavaScript is disabled.

404.html is just a normal 404 page with 4 lines of JavaScript:

var docsPattern = /(\/api)(\/(?!0.[35]\/)|$)/  
  
if (docsPattern.test(location.pathname)) {  
    location.pathname = location.pathname.replace(docsPattern, '$1/0.3$2')  
}

Breaking down the RegExp pattern:

  • (\/api) matches “/api” in the URL
  • (\/(?!0.[35]\/)|$) matches one of two things, immediately after “/api”
    • Either $, the end of the string (like “https://citation.js.org/api” without the trailing slash)
    • Or \/(?!0.[35]\/), which matches a forward slash ("/api/") followed by anything except “0.3” or “0.5”. This is to avoid matching things like “/apical/” or “/api/0.3/does-not-exist”.

This is not the neatest solution but I like it conceptually. It shows a bit of potential for Single-Page Applications as well: you can serve the same HTML+JavaScript for every possible path without having to rely on URLs like https://example.org/#/path/page. The problem is that you still get the 404 HTTP status (as you should), so if a browser or search crawler decides to care you have a problem.

Try it out now: https://citation.js.org/api/

The new "Page not Found" page in the same style as the homepage.

Monday, March 23, 2020

Economics of open source versus open science

Common postman

Common postman (Heliconius melpomene) on a Lantana

Almost two years ago I started participating on the then-new (now-archived) npm forum. I had been using npm for a few years at that point, and I had some free time to spend providing technical support, for fun. I fixed a number of bugs in the CLI, and users thanked me for those. My impact was limited, but the work was fulfilling. That is, until the developers at npm I had been working with got laid off.

Later in 2019 came the second hit: shortly after a popular JavaScript library started displaying ads to fund maintainers raised a ruckus on Twitter, npm started banning terminal ads. The ensuing chaos was a wake-up call for me. Lots of people started talking about the economics of open-source development, suggesting that open source is a fake ideology propagated by tech companies in Silicon Valley to generate value at no cost — to the companies, that is.

We were putting hours and hours of work into some ideology, and the corporations that profited from our open-source libraries gave us nothing in return. Everyone keeps laughing about the enormous dependency trees of Node.js projects, but that also means every project depends on a lot of other open-source projects, mostly by unpaid maintainers. Similarly, the bugs that I fixed for the npm CLI had a very small impact in the grand scheme of things, but npm is used by almost every company that uses JavaScript — most likely including Google, Amazon, Apple and Facebook. And a small percentage multiplied by almost all the tech capital in the world is still quite a lot.

This contradicts with what I have been taught about Open Science: ideally, all aspects of all science should be open to everyone, to allow small players to take part. The more small players can take part, the better the science is, both morally and in quality & quantity.

While in the tech world, a small but vocal group is trying to bring about a revolution to rethink open source to help the individual, at the same time the science community has just gotten into the idea of expanding open source — again, to help the individual. Is open science just a few years behind open source?

One important thing to note is that both revolutions are trying to bring about the same thing: fair representation. In fair open source, this is about maintainers of public infrastructure (in the form of libraries) getting part of the profit generated by companies using it. In open science, this is about letting everyone take part in science, from people without affiliation to people whose institution cannot or does not want to pay for access, and lowering the barrier by making source code and data available.

The main difference is probably that scientists usually get paid, at which point it is easier to choose whether to make your work open or not: not making it open would be a waste. Additionally, there is the notion that any science is good for science (and the world) as a whole: even if commercial pharmaceutical companies get to use open research (and open source software) from researcher that they did not fund, advances in pharmaceutics are good for everyone. Plus, open science helps the smaller players, which would be beneficial for competition and so prices (if market forces finally follow through).

In the middle of this is me. I maintain an open-source project (Citation.js) aimed at people who care about bibliographical data — e.g. scientists and librarians. It has 142 stars on GitHub. I am proud of it. Neither side really applies to me: I cannot think of any commercial application that needs my library, nor do I receive funding for working on a (very small) part of the scientific community. So, which revolution should I follow? Fair open source or open science?

At the moment, I am fine with keeping it as it is. Though tiresome, it is also fulfilling, and right now I can still use the Exposure™. For the longer term, I guess I will naively carry on until I burn out or someone convinces me otherwise.


Note: a proposed solution for fair open source is the Parity Public License: it allows people to use it in private without limitations, and otherwise it requires the project using it to be open-source as well. Additionally, it is possible to buy licenses for closed-source work. To me, it seems a bit limited. Licenses like this can quickly become complex to use. Do I want people to be able to use Citation.js on their personal website without making the website open source? I do not think that would be possible with this license, without personally giving permission to people who would want that.

There are probably better blog posts to be read about the trade-offs of such licenses. If you find any, I will add them here.

Friday, October 11, 2019

BibTeX Rework: Syntax Update

BibTeX Rework: Syntax Update

A rework of the BibTeX parser has been on the backlog since at least August 15, 2017, and recently I started working on actually carrying it out — systematically. There were a number of things to be improved:

  1. Complete syntax support: again, supporting BibTeX by looking at examples leads in a lack of support for less seen features like @string, @preamble and parentheses for enclosing entries instead of braces.
  2. More complete mappings: since I did not have any specifications when making the BibTeX parser, I could not find a complete list of fields, hence no complete mapping.
  3. Distinction between BibTeX and BibLaTeX: although there may not be any problems when importing, using year/month/day or date matters a lot if you have to output either.
  4. Proper schema validation: BibTeX defines required fields, but Citation.js does not check if those all exist.

In this blogpost, I will describe how I went about solving the first point: complete syntax support. Part of the problem was that I was running a bad parser, which was difficult to extend and not performing that well.

To improve this, I collected a number of BibTeX parsers to compare them on a number of criteria: performance, syntax support, build steps, and ease of maintaining. I used two single-entry BibTeX files for debugging, and a longer BibTeX file (5.2 MiB, 3345 entries) for some rough performance testing. The outcomes:

Using Time (single entry) Time (3345 entries) Syntax
Current TokenStack ~8ms ~1800ms old
Idea moo, Grammar ~2ms ~1150ms old
Idea (new) moo, Grammar ~3ms ~750ms new
Generator moo, nearley ~20ms N/A new
astrocite PEG.js ~9ms ~1670ms new
fiduswriter biblatex-csl-converter ~160ms ~119000ms new
Zotero BibTeX translator ~180ms ~31000ms old

So, the current parser was performing pretty well actually, especially compared to astrocite which I still consider a good target to aim for. TokenStack, however, was an unnecessarily complex part resulting in poor performance — and poor maintainability.

I had some trouble with PEG.js so I turned to other approaches. One thing I came across was nearley. However, this would introduce both an extra build step and an extra run-time dependency, and as the table shows did not perform very well. I assume that is on me, and my grammar-writing capabilities. One good thing that did come out of it was the use of a tokenizer or lexer, like moo.

After nearly finishing an approach using moo and Grammar, a simplified version of TokenStack with built-in support of rules, something else came up and I dropped the subject for about a year. However, recently I started over, saw my old approach and copied some stuff from there. This resulted in a even more simplified Grammar, with only matchToken, consumeToken and consumeRule support — no backtracking was needed in the end. Also, the performance was pretty good, and it was easier to implement the new syntax.

nearley grammar diagram
nearley grammar diagram

To make sure I had good results, I took some other parsers: Fidus Writer’s biblatex-csl-converter package and the Zotero BibTeX translator. The former was easy to set up, as it was just an npm package, while the latter involved quite some tricks: installing the Translation Server directly from GitHub, pointing an ENV variable to its config directory and running a piece of setup code, collecting all the translators I presume. Neither seemed to perform well in comparison to either my old parser, my new parser or astrocite, and I stress-tested all of them in terms of syntax:

@String {maintainer = "Xavier D\\'ecoret"}

@
  %a
preamble
  %a
{ "Maintained by " # maintainer }

@String(stefan = "Stefan Swe{\\i}g")
@String(and = " and ")

@Book{sweig42,
  Author =	 stefan # and # maintainer,
  title =	 { The {impossible} TEL---book },
  publisher =	 { D\\"ead Po$_{e}$t Society},
  year =	 1942,
  month =        mar
}

One area of expansion is all the ways BibTeX has to escape Unicode characters. Besides diacritics, which I should support completely, I think Zotero and astrocite are ahead in terms of completeness of symbols like \copyright. Then again, there is a great, really big list of LaTeX symbols, and not everyone needs every symbol — nor is everything represented in Unicode. I think the best way to do this is to expose function in the configuration to expand the default supported macros, but let me know if something else comes to mind.

The new parser, in its current form, has been published as part of Citation.js v0.5.0-alpha.3.

Friday, August 16, 2019

Debugging the Karmabug

Debugging the Karmabug

For reasons I will not go through right now, I needed a new library for making synchronous HTTP requests in Node.js. I know what you are saying, “But that’s one of the seven deadly sins of JavaScript!” 1 Well, just know I had my reasons, and I wanted to replace sync-request.

Since I was already using the Fetch API with node-fetch in the async part of my library, I thought: why not build a sync-fetch, using node-fetch under the hood like sync-request uses (then-)request. Two days later, it was actually working, as far as I could tell. However, if I wanted to publish this horror I need some actual testing.

Luckily node-fetch has a nice test suite, I only needed to convert 194 test cases to use the synchronous API. Not fun work, but worth its while, maybe. Anyway, the first test cases worked, but then it got stuck on the first actual request.

This is were I have to introduce you to the Karmabug. You see, after some testing I figured out that only the combination of my sync-fetch and the test server just… stopped. The arguments were correct, my fetch works with https://example.org and the test server works with node-fetch, but this combination simply did not. Investigating either pointed to the other, and I had no idea what to do next.

That would make a good tweet, I thought. “Karma for making sync HTTP requests I guess.” Literally two minutes later it hit me: that was exactly what was going on. The test server could not respond to the requests because the request itself was blocking the event loop.

1 Incidentally, the seven deadly sins of JavaScript all happen to be Sloth.

Monday, August 5, 2019

Citation.js: RIS Rework Pt. 2

Citation.js: RIS Rework Pt. 2

In the last post I explained how I started implementing the RIS specification that I found in the Internet Archive, only to discover that there is an older specification, which seems to be more common at times.

Now, I have implemented the old spec, which luckily was not nearly as complex. One thing that came to my attention was that there were a lot of redundant tags: for title, there are TI, T1, CT and usually BT; for journal names there’s JO & JF, and JA, J1, & J2 for abbreviated journal names. While I can imagine some nuanced difference in meaning between those tags, those meanings are not documented, and not trivial to figure out either. Anyway, that is not a problem for the implementation.

I also updated the implementation of the new spec, to fix some mistakes and add some more mappings. In addition, because in real life there seem to be some implementations that export a mix of the two specifications, I created an implementation based on the new spec, that if needed can defer to the old one — and some random properties that Wikipedia and Zotero have picked up somewhere, and are not in either spec.

How do the results look? First of all, the example that was giving me issues in the last post looks a lot nicer now:

{ issue: '1',
  page: '230-265',
  type: 'article-journal',
  volume: '47',
  title: 'On computable numbers, with an application to the Entscheidungsproblem',
  author: [ { family: 'Turing', given: 'Alan Mathison' } ],
  issued: { 'date-parts': [ [ 1937 ] ] },
  'container-title': 'Proc. of London Mathematical Society' }

The only thing that was missing when I first tried this out was the end of the page range, because SP in the new spec is the entire page range, while in the old spec you need both SP and EP. I had to fix that manually — not a problem, just something to keep in mind when re-running the scripts.

One other thing to check was how to the mappings look from above, without all the type-specific shenanigans. I keep a (public) spreadsheet with mappings from CSL-JSON to all kinds of different formats, so I added the RIS mappings. So, a sanity check. Does it make sense?

RIS mappings

No, not at all! The RIS tag SE is mapped to 10 different CSL variables, and the CSL number variables is mapped to 9 different RIS tags. To me, it does not make any sense, even accounting for the fact that I know there is some variation between entry types.

The question that remains is, does it work? Even if it does not look like it makes sense, the output could still make sense, if other implementations follow the specification to a similar degree. I know Zotero does not entirely follow it — all the spec anomalies that are implemented are attributed to weird EndNote quirks, not the weird spec quirks.

That made me wonder to what degree the EndNote implementation follows the specification. However, I do not have EndNote, so this is a call to action! Can you help me with clearing up the RIS cloud for me by submitting your RIS exports to a CC0-licensed repo? Preferably with all kinds of reference types — articles, books, chapters, conference papers, webpages, software, patents, bills, maps, artworks, whatever you can find. For legal reasons, please replace abstracts and other copyrightable content by [... omitted].

In the meantime, I will be collecting RIS exports from other sources, like Zotero and Mendeley and websites like Google Scholar, BMC and PubMed Central. If you know of any other sources, please let me know!

Tuesday, July 30, 2019

Citation.js: RIS Rework Pt. 1

Citation.js: RIS Rework Pt. 1

So a while ago I was looking around for the RIS specification again. I had not found it earlier, only a reference implementation from Zotero, a surprisingly complete list of tags and types on Wikipedia and some examples from various websites and programs exporting RIS files. They did not seem to go together well, however. There were some slight differences in tags here and there, and a bunch of useful tags listed by Wikipedia were labelled “degenerate” in the Zotero codebase, and only used for imports — implying some sort of problem.

What could be going on? Well, I checked out the references on the Wikipedia page again, to see if there really was no official specification or some other more reliable source where it got its information from. And, suddenly, there was an actual source this time. I do not know how I missed it earlier, but there was a page (archived) that linked to a zip file containing a PDF file with general specifications and an Excel file with sheets with property lists for all different types.

That sounded useful, so I spent waaayy to much time automating a script to turn those sheets — with a bunch of user input — into usable mappings for Citation.js. I just finished that today, apart from some… questionable mappings, but I wanted to at least test the final script with an example. As for the results, well, see for yourself. The example, from the Wikipedia page (CC-BY-SA 3.0 Unported) was

TY  - JOUR
T1  - On computable numbers, with an application to the Entscheidungsproblem
A1  - Turing, Alan Mathison
JO  - Proc. of London Mathematical Society
VL  - 47
IS  - 1
SP  - 230
EP  - 265
Y1  - 1937
ER  -

and my results were

{ issue: 1, page: 230, type: 'article-journal', volume: 47 }

That looked really weird and disappointing. Again, what could possibly be going on here? The example on Wikipedia is using T1, A1, JO and Y1 while the specs say to use TI, AU, T2 and PY here. Where are these differences coming from?

After some digging around on Wikipedia I found a comment saying that there are in fact two specifications: one from 2011 and one from before. The archived spec I checked out was from 2012 (as linked by Wikipedia!), while they use the version from before 2011; which luckily is still available. To be continued.