Friday, March 26, 2021

GitHub pages 404 redirection

Recently I moved the Citation.js API documentation from /api to /api/0.3, to put the new documentation on /api/0.5. I fixed all the links to the documentation, but I still got a issue request regarding a 404 error after just a few days. All in all, I had to redirect pages from /api/* to /api/0.3/* while all these pages are hosted as static files on GitHub Pages.

There are three ways I found to do this:

  1. I make otherwise empty HTML files in /api/* that redirect to /api/0.3/* via JavaScript or a <meta> tag.
  2. I make use of jekyll-redirect-from. This is equivalent to option 1, I think.

Option 1 seemed like a hassle and I do not use Jekyll so option 2 seemed out of the question as well. However, we still have option 3 to consider:

  1. I add a 404.html to the repository which gets served automatically on a 404. It then redirects to /api/0.3/* with JavaScript, and gives guidance on how to find the new URL manually if JavaScript is disabled.

404.html is just a normal 404 page with 4 lines of JavaScript:

var docsPattern = /(\/api)(\/(?!0.[35]\/)|$)/  
  
if (docsPattern.test(location.pathname)) {  
    location.pathname = location.pathname.replace(docsPattern, '$1/0.3$2')  
}

Breaking down the RegExp pattern:

  • (\/api) matches “/api” in the URL
  • (\/(?!0.[35]\/)|$) matches one of two things, immediately after “/api”
    • Either $, the end of the string (like “https://citation.js.org/api” without the trailing slash)
    • Or \/(?!0.[35]\/), which matches a forward slash ("/api/") followed by anything except “0.3” or “0.5”. This is to avoid matching things like “/apical/” or “/api/0.3/does-not-exist”.

This is not the neatest solution but I like it conceptually. It shows a bit of potential for Single-Page Applications as well: you can serve the same HTML+JavaScript for every possible path without having to rely on URLs like https://example.org/#/path/page. The problem is that you still get the 404 HTTP status (as you should), so if a browser or search crawler decides to care you have a problem.

Try it out now: https://citation.js.org/api/

The new "Page not Found" page in the same style as the homepage.

Wednesday, February 24, 2021

Mid-week effect of Dutch COVID-19 case reporting

Every week since the start of January I heard headlines like “3963 new infections, a bit more than average”. In context, that meant that the amount of positive COVID-19 that day was higher than the average daily amount of cases in the previous seven days. Even though we heard that every week, overall the cases were going down. What was going on? Well, somewhat unsurprisingly, the amount of reported cases was higher on weekdays than on Monday and in the weekend. The news site I linked above mentions the midweek-effect as well, but on Twitter and on the radio you mainly hear the headline, not the caveats.

Anyway, I wanted to see what the midweek-effect actually looked like. The daily infection data is available from the RIVM website, RIVM being the Dutch National Institute for Public Health and the Environment. I started by reproducing the graph they made, partly to check the file and partly for fun (see Fig. 1).

Amount of new cases on each GGD notification date
Figure 1: Amount of positive COVID-19 tests since the start of December. The date used is the date when the test result was reported to the GGD, the municipal health service. In yellow are the test results RIVM got this week, in purple previous data.

To decompose the various effects in this data, I used R’s stats::stl() function on a time series with frequency=7. This took a while to figure out, as a mis-configuration on my end of the time series led to STL interpreting each day as a season, instead of each week. Because the seasonal effect is multiplicative and not additive, I had to log-transform the data as well. This means the seasonal component has a higher amplitude when the overall amount of cases is higher (see Fig. 2).

Seasonal component and trend of data previously shown
Figure 2: Weekly component and general trend of positive COVID-19 tests, overlayed on daily data. The solid line indicates the overall trend, while the dashed line combines the overall trend with the seasonal component. Yellow still indicates test results from this week, purple the previous data.

In the end, this does not mean much at all. The accepted explanation of the midweek-effect has to do with when tests are carried out and reported; not when infections actually take place. If we look at the cases where the date of disease onset is available (source, Public Domain Mark 1.0), the weekly trend is less present (see Fig. 3).

Seasonal component and trend of new cases per day of disease onset
Figure 3: Number of new confirmed cases per day of disease onset. The solid line indicates the overall trend, while the dashed line combines the overall trend with the seasonal component. The gray area show daily cases.

Comparing the two weekly trends, the midweek-effect of reported cases is higher than the weekly effect of the disease onset of cases. In addition, the latter peaks on Mondays and declines throughout the week before as opposed to the midweek-effect (see Fig. 4). Whether this effect is actually present or another artifact of data reporting is unclear to me.

Weekly pattern of GGD test results and reported disease onsets
Figure 4: Weekly component of reported cases and disease onsets. The solid line shows the pattern for the day of disease onset, the dashed line for the day when the test result is reported.

Monday, March 23, 2020

Economics of open source versus open science

Common postman

Common postman (Heliconius melpomene) on a Lantana

Almost two years ago I started participating on the then-new (now-archived) npm forum. I had been using npm for a few years at that point, and I had some free time to spend providing technical support, for fun. I fixed a number of bugs in the CLI, and users thanked me for those. My impact was limited, but the work was fulfilling. That is, until the developers at npm I had been working with got laid off.

Later in 2019 came the second hit: shortly after a popular JavaScript library started displaying ads to fund maintainers raised a ruckus on Twitter, npm started banning terminal ads. The ensuing chaos was a wake-up call for me. Lots of people started talking about the economics of open-source development, suggesting that open source is a fake ideology propagated by tech companies in Silicon Valley to generate value at no cost — to the companies, that is.

We were putting hours and hours of work into some ideology, and the corporations that profited from our open-source libraries gave us nothing in return. Everyone keeps laughing about the enormous dependency trees of Node.js projects, but that also means every project depends on a lot of other open-source projects, mostly by unpaid maintainers. Similarly, the bugs that I fixed for the npm CLI had a very small impact in the grand scheme of things, but npm is used by almost every company that uses JavaScript — most likely including Google, Amazon, Apple and Facebook. And a small percentage multiplied by almost all the tech capital in the world is still quite a lot.

This contradicts with what I have been taught about Open Science: ideally, all aspects of all science should be open to everyone, to allow small players to take part. The more small players can take part, the better the science is, both morally and in quality & quantity.

While in the tech world, a small but vocal group is trying to bring about a revolution to rethink open source to help the individual, at the same time the science community has just gotten into the idea of expanding open source — again, to help the individual. Is open science just a few years behind open source?

One important thing to note is that both revolutions are trying to bring about the same thing: fair representation. In fair open source, this is about maintainers of public infrastructure (in the form of libraries) getting part of the profit generated by companies using it. In open science, this is about letting everyone take part in science, from people without affiliation to people whose institution cannot or does not want to pay for access, and lowering the barrier by making source code and data available.

The main difference is probably that scientists usually get paid, at which point it is easier to choose whether to make your work open or not: not making it open would be a waste. Additionally, there is the notion that any science is good for science (and the world) as a whole: even if commercial pharmaceutical companies get to use open research (and open source software) from researcher that they did not fund, advances in pharmaceutics are good for everyone. Plus, open science helps the smaller players, which would be beneficial for competition and so prices (if market forces finally follow through).

In the middle of this is me. I maintain an open-source project (Citation.js) aimed at people who care about bibliographical data — e.g. scientists and librarians. It has 142 stars on GitHub. I am proud of it. Neither side really applies to me: I cannot think of any commercial application that needs my library, nor do I receive funding for working on a (very small) part of the scientific community. So, which revolution should I follow? Fair open source or open science?

At the moment, I am fine with keeping it as it is. Though tiresome, it is also fulfilling, and right now I can still use the Exposure™. For the longer term, I guess I will naively carry on until I burn out or someone convinces me otherwise.


Note: a proposed solution for fair open source is the Parity Public License: it allows people to use it in private without limitations, and otherwise it requires the project using it to be open-source as well. Additionally, it is possible to buy licenses for closed-source work. To me, it seems a bit limited. Licenses like this can quickly become complex to use. Do I want people to be able to use Citation.js on their personal website without making the website open source? I do not think that would be possible with this license, without personally giving permission to people who would want that.

There are probably better blog posts to be read about the trade-offs of such licenses. If you find any, I will add them here.

Friday, October 11, 2019

BibTeX Rework: Syntax Update

BibTeX Rework: Syntax Update

A rework of the BibTeX parser has been on the backlog since at least August 15, 2017, and recently I started working on actually carrying it out — systematically. There were a number of things to be improved:

  1. Complete syntax support: again, supporting BibTeX by looking at examples leads in a lack of support for less seen features like @string, @preamble and parentheses for enclosing entries instead of braces.
  2. More complete mappings: since I did not have any specifications when making the BibTeX parser, I could not find a complete list of fields, hence no complete mapping.
  3. Distinction between BibTeX and BibLaTeX: although there may not be any problems when importing, using year/month/day or date matters a lot if you have to output either.
  4. Proper schema validation: BibTeX defines required fields, but Citation.js does not check if those all exist.

In this blogpost, I will describe how I went about solving the first point: complete syntax support. Part of the problem was that I was running a bad parser, which was difficult to extend and not performing that well.

To improve this, I collected a number of BibTeX parsers to compare them on a number of criteria: performance, syntax support, build steps, and ease of maintaining. I used two single-entry BibTeX files for debugging, and a longer BibTeX file (5.2 MiB, 3345 entries) for some rough performance testing. The outcomes:

Using Time (single entry) Time (3345 entries) Syntax
Current TokenStack ~8ms ~1800ms old
Idea moo, Grammar ~2ms ~1150ms old
Idea (new) moo, Grammar ~3ms ~750ms new
Generator moo, nearley ~20ms N/A new
astrocite PEG.js ~9ms ~1670ms new
fiduswriter biblatex-csl-converter ~160ms ~119000ms new
Zotero BibTeX translator ~180ms ~31000ms old

So, the current parser was performing pretty well actually, especially compared to astrocite which I still consider a good target to aim for. TokenStack, however, was an unnecessarily complex part resulting in poor performance — and poor maintainability.

I had some trouble with PEG.js so I turned to other approaches. One thing I came across was nearley. However, this would introduce both an extra build step and an extra run-time dependency, and as the table shows did not perform very well. I assume that is on me, and my grammar-writing capabilities. One good thing that did come out of it was the use of a tokenizer or lexer, like moo.

After nearly finishing an approach using moo and Grammar, a simplified version of TokenStack with built-in support of rules, something else came up and I dropped the subject for about a year. However, recently I started over, saw my old approach and copied some stuff from there. This resulted in a even more simplified Grammar, with only matchToken, consumeToken and consumeRule support — no backtracking was needed in the end. Also, the performance was pretty good, and it was easier to implement the new syntax.

nearley grammar diagram
nearley grammar diagram

To make sure I had good results, I took some other parsers: Fidus Writer’s biblatex-csl-converter package and the Zotero BibTeX translator. The former was easy to set up, as it was just an npm package, while the latter involved quite some tricks: installing the Translation Server directly from GitHub, pointing an ENV variable to its config directory and running a piece of setup code, collecting all the translators I presume. Neither seemed to perform well in comparison to either my old parser, my new parser or astrocite, and I stress-tested all of them in terms of syntax:

@String {maintainer = "Xavier D\\'ecoret"}

@
  %a
preamble
  %a
{ "Maintained by " # maintainer }

@String(stefan = "Stefan Swe{\\i}g")
@String(and = " and ")

@Book{sweig42,
  Author =	 stefan # and # maintainer,
  title =	 { The {impossible} TEL---book },
  publisher =	 { D\\"ead Po$_{e}$t Society},
  year =	 1942,
  month =        mar
}

One area of expansion is all the ways BibTeX has to escape Unicode characters. Besides diacritics, which I should support completely, I think Zotero and astrocite are ahead in terms of completeness of symbols like \copyright. Then again, there is a great, really big list of LaTeX symbols, and not everyone needs every symbol — nor is everything represented in Unicode. I think the best way to do this is to expose function in the configuration to expand the default supported macros, but let me know if something else comes to mind.

The new parser, in its current form, has been published as part of Citation.js v0.5.0-alpha.3.

Friday, August 16, 2019

Debugging the Karmabug

Debugging the Karmabug

For reasons I will not go through right now, I needed a new library for making synchronous HTTP requests in Node.js. I know what you are saying, “But that’s one of the seven deadly sins of JavaScript!” 1 Well, just know I had my reasons, and I wanted to replace sync-request.

Since I was already using the Fetch API with node-fetch in the async part of my library, I thought: why not build a sync-fetch, using node-fetch under the hood like sync-request uses (then-)request. Two days later, it was actually working, as far as I could tell. However, if I wanted to publish this horror I need some actual testing.

Luckily node-fetch has a nice test suite, I only needed to convert 194 test cases to use the synchronous API. Not fun work, but worth its while, maybe. Anyway, the first test cases worked, but then it got stuck on the first actual request.

This is were I have to introduce you to the Karmabug. You see, after some testing I figured out that only the combination of my sync-fetch and the test server just… stopped. The arguments were correct, my fetch works with https://example.org and the test server works with node-fetch, but this combination simply did not. Investigating either pointed to the other, and I had no idea what to do next.

That would make a good tweet, I thought. “Karma for making sync HTTP requests I guess.” Literally two minutes later it hit me: that was exactly what was going on. The test server could not respond to the requests because the request itself was blocking the event loop.

1 Incidentally, the seven deadly sins of JavaScript all happen to be Sloth.

Monday, August 5, 2019

Citation.js: RIS Rework Pt. 2

Citation.js: RIS Rework Pt. 2

In the last post I explained how I started implementing the RIS specification that I found in the Internet Archive, only to discover that there is an older specification, which seems to be more common at times.

Now, I have implemented the old spec, which luckily was not nearly as complex. One thing that came to my attention was that there were a lot of redundant tags: for title, there are TI, T1, CT and usually BT; for journal names there’s JO & JF, and JA, J1, & J2 for abbreviated journal names. While I can imagine some nuanced difference in meaning between those tags, those meanings are not documented, and not trivial to figure out either. Anyway, that is not a problem for the implementation.

I also updated the implementation of the new spec, to fix some mistakes and add some more mappings. In addition, because in real life there seem to be some implementations that export a mix of the two specifications, I created an implementation based on the new spec, that if needed can defer to the old one — and some random properties that Wikipedia and Zotero have picked up somewhere, and are not in either spec.

How do the results look? First of all, the example that was giving me issues in the last post looks a lot nicer now:

{ issue: '1',
  page: '230-265',
  type: 'article-journal',
  volume: '47',
  title: 'On computable numbers, with an application to the Entscheidungsproblem',
  author: [ { family: 'Turing', given: 'Alan Mathison' } ],
  issued: { 'date-parts': [ [ 1937 ] ] },
  'container-title': 'Proc. of London Mathematical Society' }

The only thing that was missing when I first tried this out was the end of the page range, because SP in the new spec is the entire page range, while in the old spec you need both SP and EP. I had to fix that manually — not a problem, just something to keep in mind when re-running the scripts.

One other thing to check was how to the mappings look from above, without all the type-specific shenanigans. I keep a (public) spreadsheet with mappings from CSL-JSON to all kinds of different formats, so I added the RIS mappings. So, a sanity check. Does it make sense?

RIS mappings

No, not at all! The RIS tag SE is mapped to 10 different CSL variables, and the CSL number variables is mapped to 9 different RIS tags. To me, it does not make any sense, even accounting for the fact that I know there is some variation between entry types.

The question that remains is, does it work? Even if it does not look like it makes sense, the output could still make sense, if other implementations follow the specification to a similar degree. I know Zotero does not entirely follow it — all the spec anomalies that are implemented are attributed to weird EndNote quirks, not the weird spec quirks.

That made me wonder to what degree the EndNote implementation follows the specification. However, I do not have EndNote, so this is a call to action! Can you help me with clearing up the RIS cloud for me by submitting your RIS exports to a CC0-licensed repo? Preferably with all kinds of reference types — articles, books, chapters, conference papers, webpages, software, patents, bills, maps, artworks, whatever you can find. For legal reasons, please replace abstracts and other copyrightable content by [... omitted].

In the meantime, I will be collecting RIS exports from other sources, like Zotero and Mendeley and websites like Google Scholar, BMC and PubMed Central. If you know of any other sources, please let me know!

Tuesday, July 30, 2019

Citation.js: RIS Rework Pt. 1

Citation.js: RIS Rework Pt. 1

So a while ago I was looking around for the RIS specification again. I had not found it earlier, only a reference implementation from Zotero, a surprisingly complete list of tags and types on Wikipedia and some examples from various websites and programs exporting RIS files. They did not seem to go together well, however. There were some slight differences in tags here and there, and a bunch of useful tags listed by Wikipedia were labelled “degenerate” in the Zotero codebase, and only used for imports — implying some sort of problem.

What could be going on? Well, I checked out the references on the Wikipedia page again, to see if there really was no official specification or some other more reliable source where it got its information from. And, suddenly, there was an actual source this time. I do not know how I missed it earlier, but there was a page (archived) that linked to a zip file containing a PDF file with general specifications and an Excel file with sheets with property lists for all different types.

That sounded useful, so I spent waaayy to much time automating a script to turn those sheets — with a bunch of user input — into usable mappings for Citation.js. I just finished that today, apart from some… questionable mappings, but I wanted to at least test the final script with an example. As for the results, well, see for yourself. The example, from the Wikipedia page (CC-BY-SA 3.0 Unported) was

TY  - JOUR
T1  - On computable numbers, with an application to the Entscheidungsproblem
A1  - Turing, Alan Mathison
JO  - Proc. of London Mathematical Society
VL  - 47
IS  - 1
SP  - 230
EP  - 265
Y1  - 1937
ER  -

and my results were

{ issue: 1, page: 230, type: 'article-journal', volume: 47 }

That looked really weird and disappointing. Again, what could possibly be going on here? The example on Wikipedia is using T1, A1, JO and Y1 while the specs say to use TI, AU, T2 and PY here. Where are these differences coming from?

After some digging around on Wikipedia I found a comment saying that there are in fact two specifications: one from 2011 and one from before. The archived spec I checked out was from 2012 (as linked by Wikipedia!), while they use the version from before 2011; which luckily is still available. To be continued.