Showing posts with label bibliography. Show all posts
Showing posts with label bibliography. Show all posts

Tuesday, March 5, 2024

Citation.js: BibLaTeX Data Annotations support

Version 0.7.9 of Citation.js comes with a new feature: plugin-bibtex now supports the import and export of Data Annotations in BibLaTeX files. This means ORCID identifiers from DOI, Wikidata, CFF, and other sources can now be exported to BibLaTeX. Combined with a BibLaTeX style that displays ORCID identifiers, you can now quickly improve your reference lists with ORCIDs.

const { Cite } = require('@citation-js/core')  
require('@citation-js/plugin-bibtex')  
require('@citation-js/plugin-doi')

Cite
  .async('10.1111/icad.12730')
  .then(cite => cite.format('biblatex'))

This produces the following BibLaTeX file (note the author+an:orcid field):

@article{Willighagen2024Mapping,  
  author = {Willighagen, Lars G. and Jongejans, Eelke},  
  author+an:orcid = {1="http://orcid.org/0000-0002-4751-4637"; 2="http://orcid.org/0000-0003-1148-7419"},  
  journaltitle = {Insect Conservation and Diversity},  
  shortjournal = {Insect Conserv Diversity},  
  doi = {10.1111/icad.12730},  
  issn = {1752-458X},  
  date = {2024-03-02},  
  language = {en},  
  publisher = {Wiley},  
  title = {Mapping wing morphs of \textit{{Tetrix} subulata} using citizen science data: Flightless groundhoppers are more prevalent in grasslands near water},  
  url = {http://dx.doi.org/10.1111/icad.12730},  
}

References

Saturday, March 2, 2024

Including ORCID identifiers in BibLaTeX (and using them)

On the Fediverse, @petrichor@digipres.club posited the question how to include identifiers for authors in Bib(La)TeX-based bibliographies:

Any Bib(La)TeX/biber users have a preferred way to include author identifiers like ORCID or ISNI in your .bib file? Ideally supported by a citation style that will include the identifiers and/or hyperlink the authors.

https://digipres.club/@petrichor/112020378570169913

I have wanted to try including ORCIDs in bibliographies for a while now, and while CSL-JSON makes it nearly trivial to encode, neither CSL styles nor CSL processors are at the point where those can actually be inserted in the formatted bibliography. However, BibLaTeX may grant more opportunities, so this piqued my interest.

I first thought of the Extended Name Format (Kime et al., 2023, §3.4), which allows breaking names up in key-value pairs. Normally, those are reserved for name parts (family, given, etc.), but I believed I had seen a way to define additional “name parts”, one of which could be used for specifying the ORCID. However, in the process of figuring that out, I found the actual, intended, proper solution.

BibLaTeX has, exactly for things like this, Data Annotations (Kime et al., 2023, §3.7). For every field, or every item of every field in the case of list fields, additional annotations can be provided. (There are some additional features and nuances; for a full explanation see the manual.) For ORCIDs, data annotations could look like this:

@software{willighagen_2022_7017208,
  author          = {Willighagen, Lars and
                     Willighagen, Egon},
  author+an:orcid = {1="0000-0002-4751-4637"; 2="0000-0001-7542-0286"},
  title           = {ISAAC Chrome Extension},
  month           = aug,
  year            = 2022,
  publisher       = {Zenodo},
  version         = {v1.4.0},
  doi             = {10.5281/zenodo.7017208}
}

Now, implementing it in a BibLaTeX style proved more difficult than I hoped, but that might have been due to my inexperience with argument expansion and Biber internals. I started with the authoryear style and looked for the default name format that it uses in bibliographies; this turned out to be family-given/given-family. I copied that definition, and amended it to include the ORCID icon after each name (when available). To insert the icon, I used the orcidlink package. This part was tricky, as \getitemannotation does not work in an argument to \orcidlink or \href, but I ended up with the following.

\DeclareNameFormat{family-given/given-family}{%
  % ...
  \hasitemannotation[\currentname][orcid]
    {~\orcidlink{\expandafter\csname abx@annotation@literal@item@\currentname @orcid@\the\value{listcount}\endcsname}}
    {}%
  % ...
  }

References. Willighagen, Lars [ORCID icon with cyan outline] and Egon Willighagen [ORCID icon with cyan outline] (Aug. 2022). ISAAC Chrome Extension. Version v1.4.0. DOI: 10.5281/zenodo.7017208

You could repeat the same with ISNI links, or Wikidata, VIAF, you get the idea. Then you could put the \DeclareNameFormat in a new authoryear-orcid.bbx file so that the changes do not show up in the in-text citations, and set the bibliography style like so:

\usepackage[bibstyle=authoryear-orcid]{biblatex}

This can all be seen in action on Overleaf: https://www.overleaf.com/read/gvxqmrqmwswh#f156b5

References

Friday, February 2, 2024

Three new userscripts for Wikidata

Today I worked on three user scripts for Wikidata. Together, these tools hopefully make the data in Wikidata more accessible and make it easier to navigate between items. To enable these, include one or more of the following lines in your common.js (depending on which script(s) you want):

mw.loader.load('//www.wikidata.org/w/index.php?title=User:Lagewi/properties.js&oldid=2039401177&action=raw&ctype=text/javascript');
mw.loader.load('//www.wikidata.org/w/index.php?title=User:Lagewi/references.js&oldid=2039248554&action=raw&ctype=text/javascript');
mw.loader.load('//www.wikidata.org/w/index.php?title=User:Lagewi/bibliography.js&oldid=2039245516&action=raw&ctype=text/javascript');

User:Lagewi/properties.js

Note: It turns out there is a Gadget, EasyQuery, that does something similar to this user script.

Inspired by the interface for observation fields on iNaturalist, I wanted to easily find entities that also used a certain property, or that had a specific property-value combination. This script adds two types of links:

  • For each property, a link to query results listing entities that have that property.
  • For each statement, a link to query results listing entities that have that claim, e.g. this property-value combination. (This does not account for qualifiers.)

These queries are certainly not useful for all properties: listing instance of (P31) human (Q5) is not particularly meaningful in the context of a specific person. However, sometimes it just helps to find other items that use a property, and listing other compounds found in taxon (P703) Pinus sylvestris (Q133128) is interesting.

https://www.wikidata.org/wiki/User:Lagewi/properties.js

Screenshot of three claims on a Wikidata page. Below each property is a link titled "List entities with this property", and below each value a link titled "List entities with this claim".

User:Lagewi/references.js

Sometimes, the data on Wikidata does not answer all your questions. Some types of information are difficult to encode in statements, or simply has not been encoded on Wikidata yet. In such cases, it might be useful to go through the references attached to claims of the entity, for additional information. To simplify this process, this user script lists all unique references based on stated in (P248) and reference URL (P854). The references are listed in a collapsible list below the table of labels and descriptions, collapsed by default to not be obtrusive.

https://www.wikidata.org/wiki/User:Lagewi/references.js

Screenshot of the Wikidata page of Sylvie Deleurance (Q122350848) with, below the list of labels, descriptions, and aliases, a collapsible list titled "Other resources", containing links.

User:Lagewi/bibliography.js

Going a step further than the previous script, this user script appends a full bibliography at the bottom of the page. This uses the (relatively) new citation-js tool hosted on Toolforge (made using Citation.js). Every reference in the list of claims has a footnote link to its entry in the bibliography, and every entry in the bibliography has a list of back-references to the claims where the reference is used, labeled by the property number.

https://www.wikidata.org/wiki/User:Lagewi/bibliography.js

Screenshot of two claims on Wikidata, each with a reference. Both references have a darker grey header containing a link, respectively "[4]" and "[5]".

Screenshot of a numbered list titled "Bibliography". Each item in the list is a formatted reference, and below each reference is a series of links titled with a property number.

Saturday, December 31, 2022

Citation.js: 2022 in review

Following up on the previous two updates this year (Version 0.5 and a 2022 update and Version 0.6: CSL v1.0.2), here are some updates from the second half of 2022, as well as some statistics.


Sapygina decemguttata, a small wasp, observed July 8th, 2021

Changes

  • The mappings of the Wikidata plugin were updated, especially to accomodate for software.
  • The default-locale setting of some CSL styles is now respected.

New plugins

New users

  • WikiPathways is in the process of moving to a new site which uses Citation.js to generate citations.
  • Page, R. D. (2022). Wikidata and the bibliography of life. PeerJ, 10, e13712. 10.7717/peerj.13712
  • Boer, D. (2021). A Novel Data Platform for Low-Temperature Plasma Physics. [Master’s thesis] Radboud University, Nijmegen, The Netherlands. URL

Statistics

@citation-js/core was downloaded approximately 240,000 times in 2022, against 205,000 times in 2021. The legacy package citation-js was downloaded approximately 160,000 times (compared to 190,000 times in 2021). This is good, as it indicates people are starting to use the new system more often. Note that most downloads of citation-js also lead to a download of @citation-js/core so most people still use the legacy package. The shift to stop using citation-js seemed to have started in October 2021, which coincides with the time time I forgot to update that package for 2 months after releasing v0.5.2 of @citation-js/core.

Since February 2022 a relative decrease in downloads of @citation-js/plugin-wikidata compared to core is also visible. In December the Wikidata plugin was downloaded more than 50% fewer times, in fact. This is also good and the exact point of the modularisation introduced back in 2018: to let users choose which formats to include. The DOI plugin was similarly less popular, while the BibTeX and CSL plugins were — as expected — almost always included. Surprisingly, BibJSON, a very niche format, only had a 25% reduction (maybe due to confusion with BibTeX and the internal BibTeX JSON format), while RIS, after BibTeX the most common format, had a 50% reduction as well.

New Year’s Eve tradition

After the releases on New Year’s Eve of 2016, 2017, and 2021, this New Year’s Eve also brings the new v0.6.5 release which changes the priority of some RIS fields in very specific situations, most notably the date field of conference papers (now using DA instead of C2).

Happy New Year!

Tuesday, May 31, 2022

Citation.js Version 0.6: CSL v1.0.2

Since the citation-js npm package was first published, version 0.6 is the first major version of Citation.js that did not start out as a pre-release. Version 0.3 itself spent almost 6 months in pre-release, but only received updates for less than half a month. Version 0.4 spent more than a year in pre-release and received updates for about 4 months. Version 0.5 takes the cake with one and a half years in pre-release, receiving updates for a year, also making it the best-maintained version.

Yellow flowers with lots of little "rays" on greenish brown stems

Tussilago farfara, March 27th, 2022

Version 0.6 is a major version bump because it introduces a number of breaking changes, including raising the minimal Node.js version to 14. Since April 2022, Node.js 12 is End-Of-Life, which led to a lot of dependencies dropping support. Now, Citation.js does so too. Other changes include the following:

Update data format to CSL v1.0.2

The internal data format is now updated from CSL v1.0.1 to v1.0.2. This introduces the software type and the generic document type, as well as some other types, and some new fields. The event field is also renamed to event-title. That, and software replacing book, makes it so that CSL v1.0.2 is not compatible with CSL v1.0.1 styles, making it a breaking change.

  • CSL data is now automatically upgraded to v1.0.2 on input.
  • Cite#data ((new Cite()).data) now contains CSL v1.0.2 data.
  • Output formatters of plugins now receive CSL v1.0.2 data as input.
  • util (import { util } from '@citation-js/core') now has two functions, downgradeCsl and upgradeCsl, to convert data between the two versions.
  • The data formatter (.format('data')) now takes a version option. When set to '1.0.1', this downgrades the CSL data before outputting.
  • @citation-js/plugin-csl already automatically downgrades CSL to v1.0.1 for compatibility with the style files.
  • Custom fields are now generally put in the custom object, instead of prefixing an underscore to the field name.

The mappings are also updated. Especially the RIS and BibLaTeX mappings were made more complete by the increased capabilities of the CSL data schema. Non-core plugins are also being updated, mainly affecting @citation-js/plugin-software-formats and @citation-js/plugin-zotero-translation-server.

Test coverage

While updating the plugin mappings, the test suites of the plugins were also expanded. This led to the identification of a number of bugs, that were also fixed in this release:

  • BibLaTeX
    • handling of CSL entries without a type
    • handling of bookpagination
    • handling of masterthesis
  • RIS
    • RegExp pattern for ISSNs
    • Name parsing of single-component names

Closing issues

A number of issues were also fixed in this release:

  • Adding full support for the Bib(La)TeX crossref field
  • Mapping BibLaTeX eid to number instead of page
  • Adding a mapping for the custom BibLaTeX s2id field
  • In Wikidata, getting issue/volume/page/publication date info from qualifiers as well as top-level properties.

CSL styles

The bundled styles (apa, vancouver, and harvard1) were updated. Note that harvard1 is now an alias for harvard-cite-them-right. Quoting the documentation:

The original “harvard1.csl” independent CSL style was not based on any style guide, but it nevertheless remained popular because it was included by default in Mendeley Desktop. We have now taken one step to remove this style from the central CSL repository by turning it into a dependent style that uses the Harvard author-date format specified by the popular “Cite Them Right” guide. This dependent style will likely be removed from the CSL repository entirely at some point in the future.
http://www.zotero.org/styles/harvard1, CC-BY-SA-3.0

Looking forward

Some breaking changes are still pending, mainly changes to the plugin API and the removal of some left-over APIs. However, I also want to work on a more comprehensive format for machine-readable mappings, a format for mappings for linked data, and of course implementing more mappings in general!

Friday, May 27, 2022

Citation.js Version 0.5 and a 2022 update

Version 0.5.0

Version 0.5.0 of Citation.js was released on April 1st, 2021.

BibTeX and BibLaTeX

After the update to the Bib(La)TeX file parser, described in the earlier BibTeX Rework: Syntax Update blog post, the mapping of BibTeX and BibLaTeX data to CSL-JSON was also updated. The mapping is now split in two, one for BibLaTeX (which is backwards-compatible with BibTeX) and one for BibTeX. The output formats were also updated to output either BibTeX-compatible files or BibLaTeX-compatible files. The most common difference there is the use of year and month versus date respectively. In addition, a number of updates were made to the file parser.

Core changes

In the Grammar utility class, bugs were fixed an behavior was updated to better account for the Bib(La)TeX parser. Some of the code for correcting CSL-JSON was also updated, including moving the code correcting results of the Crossref API from the DOI plugin to the core module as CSL-JSON from the API may end up in Citation.js through other methods than the DOI plugin. Earlier in 0.5 development, some of the HTTP handling code was also updated for increased stability.

2022 update

v0.5.1v0.5.7

The version released since v0.5.0 mostly contain bug fixes and small enhancements. The latter includes some more descriptive errors in certain places, as well as mapping some non-standard fields in Bib(La)TeX and RIS.

New site design

The design of the Citation.js site was updated for the first time since 2018. The changes were detailed in the recent Citation.js: New site blog post.

New plugins

New plugins for the refer file format (plugin-refer) and the RefWorks tagged format (plugin-refworks) were released.

More coming

More changes are expected, including more long-awaited output formats, better mappings for software and datasets, and more work on machine-readable mappings.

Friday, October 11, 2019

BibTeX Rework: Syntax Update

BibTeX Rework: Syntax Update

A rework of the BibTeX parser has been on the backlog since at least August 15, 2017, and recently I started working on actually carrying it out — systematically. There were a number of things to be improved:

  1. Complete syntax support: again, supporting BibTeX by looking at examples leads in a lack of support for less seen features like @string, @preamble and parentheses for enclosing entries instead of braces.
  2. More complete mappings: since I did not have any specifications when making the BibTeX parser, I could not find a complete list of fields, hence no complete mapping.
  3. Distinction between BibTeX and BibLaTeX: although there may not be any problems when importing, using year/month/day or date matters a lot if you have to output either.
  4. Proper schema validation: BibTeX defines required fields, but Citation.js does not check if those all exist.

In this blogpost, I will describe how I went about solving the first point: complete syntax support. Part of the problem was that I was running a bad parser, which was difficult to extend and not performing that well.

To improve this, I collected a number of BibTeX parsers to compare them on a number of criteria: performance, syntax support, build steps, and ease of maintaining. I used two single-entry BibTeX files for debugging, and a longer BibTeX file (5.2 MiB, 3345 entries) for some rough performance testing. The outcomes:

Using Time (single entry) Time (3345 entries) Syntax
Current TokenStack ~8ms ~1800ms old
Idea moo, Grammar ~2ms ~1150ms old
Idea (new) moo, Grammar ~3ms ~750ms new
Generator moo, nearley ~20ms N/A new
astrocite PEG.js ~9ms ~1670ms new
fiduswriter biblatex-csl-converter ~160ms ~119000ms new
Zotero BibTeX translator ~180ms ~31000ms old

So, the current parser was performing pretty well actually, especially compared to astrocite which I still consider a good target to aim for. TokenStack, however, was an unnecessarily complex part resulting in poor performance — and poor maintainability.

I had some trouble with PEG.js so I turned to other approaches. One thing I came across was nearley. However, this would introduce both an extra build step and an extra run-time dependency, and as the table shows did not perform very well. I assume that is on me, and my grammar-writing capabilities. One good thing that did come out of it was the use of a tokenizer or lexer, like moo.

After nearly finishing an approach using moo and Grammar, a simplified version of TokenStack with built-in support of rules, something else came up and I dropped the subject for about a year. However, recently I started over, saw my old approach and copied some stuff from there. This resulted in a even more simplified Grammar, with only matchToken, consumeToken and consumeRule support — no backtracking was needed in the end. Also, the performance was pretty good, and it was easier to implement the new syntax.

nearley grammar diagram
nearley grammar diagram

To make sure I had good results, I took some other parsers: Fidus Writer’s biblatex-csl-converter package and the Zotero BibTeX translator. The former was easy to set up, as it was just an npm package, while the latter involved quite some tricks: installing the Translation Server directly from GitHub, pointing an ENV variable to its config directory and running a piece of setup code, collecting all the translators I presume. Neither seemed to perform well in comparison to either my old parser, my new parser or astrocite, and I stress-tested all of them in terms of syntax:

@String {maintainer = "Xavier D\\'ecoret"}

@
  %a
preamble
  %a
{ "Maintained by " # maintainer }

@String(stefan = "Stefan Swe{\\i}g")
@String(and = " and ")

@Book{sweig42,
  Author =	 stefan # and # maintainer,
  title =	 { The {impossible} TEL---book },
  publisher =	 { D\\"ead Po$_{e}$t Society},
  year =	 1942,
  month =        mar
}

One area of expansion is all the ways BibTeX has to escape Unicode characters. Besides diacritics, which I should support completely, I think Zotero and astrocite are ahead in terms of completeness of symbols like \copyright. Then again, there is a great, really big list of LaTeX symbols, and not everyone needs every symbol — nor is everything represented in Unicode. I think the best way to do this is to expose function in the configuration to expand the default supported macros, but let me know if something else comes to mind.

The new parser, in its current form, has been published as part of Citation.js v0.5.0-alpha.3.

Monday, August 5, 2019

Citation.js: RIS Rework Pt. 2

Citation.js: RIS Rework Pt. 2

In the last post I explained how I started implementing the RIS specification that I found in the Internet Archive, only to discover that there is an older specification, which seems to be more common at times.

Now, I have implemented the old spec, which luckily was not nearly as complex. One thing that came to my attention was that there were a lot of redundant tags: for title, there are TI, T1, CT and usually BT; for journal names there’s JO & JF, and JA, J1, & J2 for abbreviated journal names. While I can imagine some nuanced difference in meaning between those tags, those meanings are not documented, and not trivial to figure out either. Anyway, that is not a problem for the implementation.

I also updated the implementation of the new spec, to fix some mistakes and add some more mappings. In addition, because in real life there seem to be some implementations that export a mix of the two specifications, I created an implementation based on the new spec, that if needed can defer to the old one — and some random properties that Wikipedia and Zotero have picked up somewhere, and are not in either spec.

How do the results look? First of all, the example that was giving me issues in the last post looks a lot nicer now:

{ issue: '1',
  page: '230-265',
  type: 'article-journal',
  volume: '47',
  title: 'On computable numbers, with an application to the Entscheidungsproblem',
  author: [ { family: 'Turing', given: 'Alan Mathison' } ],
  issued: { 'date-parts': [ [ 1937 ] ] },
  'container-title': 'Proc. of London Mathematical Society' }

The only thing that was missing when I first tried this out was the end of the page range, because SP in the new spec is the entire page range, while in the old spec you need both SP and EP. I had to fix that manually — not a problem, just something to keep in mind when re-running the scripts.

One other thing to check was how to the mappings look from above, without all the type-specific shenanigans. I keep a (public) spreadsheet with mappings from CSL-JSON to all kinds of different formats, so I added the RIS mappings. So, a sanity check. Does it make sense?

RIS mappings

No, not at all! The RIS tag SE is mapped to 10 different CSL variables, and the CSL number variables is mapped to 9 different RIS tags. To me, it does not make any sense, even accounting for the fact that I know there is some variation between entry types.

The question that remains is, does it work? Even if it does not look like it makes sense, the output could still make sense, if other implementations follow the specification to a similar degree. I know Zotero does not entirely follow it — all the spec anomalies that are implemented are attributed to weird EndNote quirks, not the weird spec quirks.

That made me wonder to what degree the EndNote implementation follows the specification. However, I do not have EndNote, so this is a call to action! Can you help me with clearing up the RIS cloud for me by submitting your RIS exports to a CC0-licensed repo? Preferably with all kinds of reference types — articles, books, chapters, conference papers, webpages, software, patents, bills, maps, artworks, whatever you can find. For legal reasons, please replace abstracts and other copyrightable content by [... omitted].

In the meantime, I will be collecting RIS exports from other sources, like Zotero and Mendeley and websites like Google Scholar, BMC and PubMed Central. If you know of any other sources, please let me know!

Tuesday, July 30, 2019

Citation.js: RIS Rework Pt. 1

Citation.js: RIS Rework Pt. 1

So a while ago I was looking around for the RIS specification again. I had not found it earlier, only a reference implementation from Zotero, a surprisingly complete list of tags and types on Wikipedia and some examples from various websites and programs exporting RIS files. They did not seem to go together well, however. There were some slight differences in tags here and there, and a bunch of useful tags listed by Wikipedia were labelled “degenerate” in the Zotero codebase, and only used for imports — implying some sort of problem.

What could be going on? Well, I checked out the references on the Wikipedia page again, to see if there really was no official specification or some other more reliable source where it got its information from. And, suddenly, there was an actual source this time. I do not know how I missed it earlier, but there was a page (archived) that linked to a zip file containing a PDF file with general specifications and an Excel file with sheets with property lists for all different types.

That sounded useful, so I spent waaayy to much time automating a script to turn those sheets — with a bunch of user input — into usable mappings for Citation.js. I just finished that today, apart from some… questionable mappings, but I wanted to at least test the final script with an example. As for the results, well, see for yourself. The example, from the Wikipedia page (CC-BY-SA 3.0 Unported) was

TY  - JOUR
T1  - On computable numbers, with an application to the Entscheidungsproblem
A1  - Turing, Alan Mathison
JO  - Proc. of London Mathematical Society
VL  - 47
IS  - 1
SP  - 230
EP  - 265
Y1  - 1937
ER  -

and my results were

{ issue: 1, page: 230, type: 'article-journal', volume: 47 }

That looked really weird and disappointing. Again, what could possibly be going on here? The example on Wikipedia is using T1, A1, JO and Y1 while the specs say to use TI, AU, T2 and PY here. Where are these differences coming from?

After some digging around on Wikipedia I found a comment saying that there are in fact two specifications: one from 2011 and one from before. The archived spec I checked out was from 2012 (as linked by Wikipedia!), while they use the version from before 2011; which luckily is still available. To be continued.

Saturday, May 25, 2019

Citation.js: Wikidata Update

Citation.js: Wikidata Update

The new update, v0.4.4, contains a few Wikidata improvements (commit):

  • 22 new mappings
  • 2 fixed mappings (ISSN did not work and publisher-place was mapped to the wrong thing)
  • 2 improved mappings (container-title for chapters and more URL mappings)
  • 1 removed mapping (genre was inconsistent with the intended use, although it followed the specification)

Because most of these mappings require additional resources (recipient has to fetch people, review-* has to fetch the review subject and original-publisher-place has to fetch the country and location of the publisher of the original version of a work) this would have caused a lot more requests to the API. It is a good thing, then, that there is a new resolver in place, which pre-fetches such information for all requested works at once.

This has to be done in levels; you cannot fetch the country if you do not have the location, the location requires the publisher and to find out the publisher you have to have the work already. However, temporarily ignoring the cap of 50 items per request, the number of requests is now only based on those levels (which is at most five), instead of the number of claims requiring additional requests.

For example, item Q21972834 (Assembling the 20 Gb white spruce (Picea glauca) genome from whole-genome shotgun sequencing data) takes:

  • 6 requests in v0.4.0-rc.2, before requests for properties were grouped (so it would make a request for every author)
  • 3 requests in v0.4.3: one for the work, one for the authors, and one for journal
  • 2 requests in v0.4.4, which includes a bunch of new mappings that would have cost a total of 5 requests in the previous version

Grouping the requests from all works has the added benefit of not having to fetch the same author and journal info repeatedly — which can be quite common for bibliographies. For example, the bibliography of Groovy Cheminformatics with the Chemistry Development Kit, currently containing 74 items, takes 11 requests in the new version. For the same bibliography, the previous version would make at least 150 requests — for each item one for all the authors and one for the journal — and probably more than that.

[core] GET https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q24599948%7CQ24650571%7CQ25713029%7CQ27061829%7CQ27062363%7CQ27062596%7CQ27065423%7CQ27093381%7CQ27134682%7CQ27134746%7CQ27134827%7CQ27162658%7CQ27211680%7CQ27499209%7CQ27656255%7CQ27783585%7CQ27783587%7CQ27902272%7CQ28090714%7CQ28133283%7CQ28186592%7CQ28837846%7CQ28837922%7CQ28837925%7CQ28837939%7CQ28837943%7CQ28837947%7CQ28842810%7CQ28842968%7CQ28843132%7CQ29039683%7CQ29042322%7CQ29616639%7CQ30149558%7CQ30853915%7CQ31127242%7CQ33874102%7CQ34160151%7CQ34206190%7CQ36662828%7CQ37988904%7CQ39811432%7CQ42704791%7CQ47543807%7CQ47632144%7CQ54062338%7CQ55880270%7CQ55934414%7CQ55954394%7CQ56112883&format=json&languages=en {}  
[core] GET https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q56170978%7CQ56454405%7CQ57836257%7CQ59771351%7CQ60167226%7CQ60167327%7CQ60167615%7CQ60167690%7CQ61463648%7CQ61649587%7CQ61779181%7CQ61779373%7CQ61779901%7CQ61779905%7CQ61779940%7CQ61780124%7CQ62926016%7CQ62926155%7CQ62927888%7CQ62968825%7CQ62969352%7CQ62969354%7CQ63367539%7CQ63367548&format=json&languages=en {}  
[core] GET https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q135122%7CQ1223539%7CQ59196164%7CQ1860%7CQ8513%7CQ766195%7CQ7441%7CQ28946414%7CQ50332566%7CQ46330054%7CQ57218960%7CQ57218835%7CQ55965812%7CQ43370919%7CQ37391332%7CQ57677801%7CQ60446154%7CQ60447576%7CQ60449513%7CQ29946263%7CQ60465926%7CQ60890297%7CQ20895241%7CQ910067%7CQ3007982%7CQ52113739%7CQ5111731%7CQ29405902%7CQ47473872%7CQ121182%7CQ128570%7CQ910164%7CQ2383032%7CQ1130645%7CQ908710%7CQ5727848%7CQ27065426%7CQ28946652%7CQ42783959%7CQ4420286%7CQ749647%7CQ309823%7CQ28540892%7CQ29052381%7CQ29052386%7CQ4914910%7CQ12149006%7CQ853614%7CQ1418791%7CQ50731867&format=json&languages=en {}  
[core] GET https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q56007851%7CQ5195068%7CQ11351%7CQ75%7CQ151332%7CQ3002926%7CQ27061944%7CQ27134800%7CQ6294930%7CQ27061853%7CQ28540435%7CQ28854723%7CQ766383%7CQ1425625%7CQ28540616%7CQ55406442%7CQ483666%7CQ11173%7CQ39972290%7CQ1069211%7CQ36534%7CQ27211732%7CQ251%7CQ1768406%7CQ1689854%7CQ30046697%7CQ55406542%7CQ5227350%7CQ38372872%7CQ38373802%7CQ3841253%7CQ55213915%7CQ7440973%7CQ30045595%7CQ451553%7CQ466769%7CQ203250%7CQ54837%7CQ3355939%7CQ40023319%7CQ26842658%7CQ109081%7CQ82264%7CQ898902%7CQ27711423%7CQ1767639%7CQ7209103%7CQ27768873%7CQ900316%7CQ48803&format=json&languages=en {}  
[core] GET https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q1884753%7CQ15064016%7CQ895901%7CQ2166958%7CQ907701%7CQ309%7CQ28796322%7CQ19845641%7CQ27061849%7CQ28923506%7CQ30046701%7CQ28865170%7CQ43370334%7CQ29387575%7CQ50983316%7CQ7395247%7CQ192864%7CQ336658%7CQ58409692%7CQ56421913%7CQ55182163%7CQ55182165%7CQ56670283%7CQ925779%7CQ50731930%7CQ909510%7CQ381009%7CQ38173%7CQ30046335%7CQ43370883%7CQ3705921%7CQ1154615%7CQ2334061%7CQ1988917%7CQ898967%7CQ2425378%7CQ10354104%7CQ47  480%7CQ1709878%7CQ900502&format=json&languages=en {}  
[core] GET https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q1120519%7CQ1860%7CQ113337%7CQ1767639%7CQ27134800%7CQ6294930%7CQ251%7CQ5776092%7CQ56421877%7CQ463360%7CQ109081%7CQ50731930%7CQ176916%7CQ26707540%7CQ30046697%7CQ28925563%7CQ1122491%7CQ3186908%7CQ969707%7CQ1948400%7CQ192864%7CQ898902%7CQ2835897%7CQ3007982%7CQ46155617%7CQ905549%7CQ485223%7CQ900502%7CQ15766522&format=json&languages=en {}  
[core] GET https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q7050%7CQ32%7CQ64%7CQ225471&format=json&languages=en {}  
[core] GET https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q47501431&format=json&languages=en {}  
[core] GET https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q33467704&format=json&languages=en {}  
[core] GET https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q55&format=json&languages=en {}  
[core] GET https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q183%7CQ145&format=json&languages=en {}  

However, I do notice slightly more requests with a low number of items than I would expect. I will look into that.

Conference papers in Wikidata

Part of the mappings that were added in the recent update are the event-* properties, i.e. “name of the related event (e.g. the conference name when citing a conference paper)” (spec). Finding out how those are modelled in Wikidata proved a bit of a challenge, as most instances of Q23927052 (conference paper) are published in proceedings with the book type. To get some numbers (query):

type typeLabel items
Q571 book 715
Q5633421 scientific journal 676
Q1143604 proceedings 315
Q16024164 medical journal 53
Q23927052 conference paper 48
Q1002697 periodical literature 32
Q41298 magazine 29
other other 61

48 conference papers are published in conference papers? That does not seem right. But maybe they just have the wrong type, but still link to the event? Well no, not really (query).

hasEvent items
false 1952
true 6

Luckily, the information is not lost: usually, the label contains location and date information about the event, although extracting that would be probably be a tedious task. However, perhaps there are a lot of proper proceedings out there, but the articles linked to it just have the wrong type (query)?

type typeLabel items
Q13442814 scholarly article 3513
Q23927052 conference paper 315
Q3331189 version, edition, or translation 11
Q1143604 proceedings 9
Q10885494 scientific conference paper 7
Q1980247 chapter 6
Q191067 article 5
Q18918145 academic journal article 4
Q333291 abstract 3
other other 10

That seems to be the case: most works that are published in instances of proceedings are tagged as instances of scholarly articles, and only nine are tagged as both. And: relatively many have events linked to them (query).

hasEvent items
true 2692
false 1184

That’s better. I’ll look into working together with WikiCite to work on the rest.

Thursday, April 11, 2019

Citation.js Version 0.4: The Catch-up

Citation.js Version 0.4: The Catch-up

After about two years, finally v0.4.0 is ready for a release. Whether it is a real milestone given that there have been prereleases for about two years, while 0.3.x only lasted three weeks, is for you to decide but I am glad that everything I planned for this version is implemented now.

Citation.js logo

Changes

So, let’s go through the major changes.

Modules

Most importantly, the code is fully modulated now. The different components, the core, the CLI and all the individual format plugins, are available on their own now. The citation-js is now designed as a shell around those components, maintaining full backwards compatibility (apart from mapping changes) until that is no longer required.

Mappings

A lot of different formats where either improved or introduced. The main newcomer is RIS, still only as output format. Support for NDJSON output was also added. The handling of styling and case-preserving brackets in BibTeX fields was improved, although the method of parsing those files is still lagging behind.

Thanks to (independent) practical testing by Egon Willighagen and Jakob Voss and the wikidata-sdk library by Maxime Lathuilière we were also able to improve the Wikidata mapping quite a lot, although there is still work ahead.

Configuration

Apart from the potential for customization that comes with the separate modules, there is quite some new configuration available. First of all, input parsing has some options: strict which basically switches between errors and failing silently (not-so-long ago the default behavior); and target which allows the user to specify a certain point at which the parsing should be stopped, mainly useful for debugging.

Furthermore, individual input formats can be configured now too. For example, the default languages used in the Wikidata plugin can be configured now. The methods to add CSL templates and locales which was already available is using the same mechanism now.

Finally, CSL output with citeproc-js has been amended. First and foremost, support for citation has been added (or rather, the original citation-* has been renamed to bibliography), so it is possible to get actual citations now. Also, the nosort option has been added and the prepend/append options been improved.

Stability & best practices

Also important, there’s a lot more testing now and the commit messages and code style follow standard guidelines. Making a new repository for the different modules actually helped with that, allowing me to re-evaluate the decisions I made in that aspect.

The full changelog is available here.

Paper & Preprint

In the middle of all that (literally, the development halted for a few months) I wrote a paper about Citation.js, which I am currently revising after review. The preprint is available here.

Further development

There’s still a lot to be done, most of which has been discussed in the “Discussion” section of the preprint. I’ll go into some things that are planned for the (relatively) near future, other than additional mappings of course.

(Another) Wikidata refactor

Part of the recent development to finish the release was a refactoring of the Wikidata parsing code to facilitate some changes and mainly to reduce unnecessary code duplication, originally the result of sync/async variants of almost every function. However, while working on that I came up with an even better refactor which should minimize the problems of under-fetching that I described in the preprint.

To recap, under-fetching is the problem of not being able to fetch enough information in the desired amount of HTTP requests. For example, in Wikidata it’s not possible to get item labels of property values such as P50 (author), and they require another HTTP call. It is not possible to fetch them together with the publication item you are fetching since you do not know who the authors are yet.

So the minimal number of requests (not taking the limit of 50 items per request) is one for each “level” you need to go down: author labels would be the second level, and fetching the labels of name items if you decide to use P735 (given name) and P734 (family name) would be the third. That’s what this refactor will try to accomplish, and it shouldn’t be too complex but still non-trivial.

Turning the chain parser into a tree parser

Currently the input is parsed iteratively, in a “chain” of types. For example, a Wikidata ID becomes a Wikidata API URL, which returns JSON, which gets parsed, and the resulting API response gets transformed into CSL-JSON. This also ensures that people could input a Wikidata API URL if they want, and that fetching and parsing JSON from a web resource doesn’t have to be implemented over and over again.

When encountering arrays, if they are not recognized as some special array, each element is parsed until the target (CSL-JSON) is reached, at which point it returns. However, this proved problematic with some new features, like the target option (parse until this format is reached, instead of the default CSL-JSON). Firstly, such options cannot be passed down to the other parsing functions, and secondly if a certain target format is reached in the array elements then the total format is always going to be an array of that, and will never match the target.

A refactor of the parser, branching it instead of whatever is going on currently, should counter this.

More citeproc interactions

citeproc is currently used in an almost state-less manner, which works fine for bibliographies but is quite limiting for working with citations. While users could redirect CSL output to citeproc themselves, why not do it for them with more bibliography management, which Citation.js mostly lacks at the moment.

Ditching CSL-JSON as the central format?

CSL-JSON is great, but also quite limiting considering the available properties, the general problems of which are also discussed in the preprint. While additions could be made to CSL in terms of variables, which Frank Bennet has already done with CSL-M, and which the CSL team is doing now with a new release finally specifying a computer program type, CSL remains intended for generating citations, not for storing bibliographical data. At least, according to a quote that I cannot find anymore.

However, this becomes quite clear practically. I count 10 of the 78 fields being specific to citing the item. An additional 4 are (usually) redundant and 3 poorly defined, one with just a question mark as description. On top of that, 12 could be replaced by allowing references to other publications, which would require just 4 fields instead, and 4 that could be replaced by other entities such as venues requiring no additional properties.

Still, moving to a different format is quite something and this will have to be discussed. Of course, support for CSL-JSON wouldn’t be removed either way, just not exclusively used for storage. Also, these changes would take place way after any pending changes.


And that is it for this post. Any feedback on the release and ideas for the things discussed above is very welcome here, in the GitHub repo or on Twitter (@larswillighagen).

Thursday, January 3, 2019

Citation.js: Wikidata Subclasses

Citation.js: Wikidata Subclasses

I am in the process of creating a better way to map CSL types to Wikidata types together with Jakob Voß, and while listing all subtypes of web page, I came across a number of these cases (link):

#defaultView:Graph
SELECT ?a ?aLabel ?b ?bLabel WHERE {
  wd:Q58803899 wdt:P279* ?a .
  ?a wdt:P279 ?b .
  FILTER EXISTS {
    ?b wdt:P279* wd:Q386724 .
  }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}

I’m no expert on Wikidata data models, nor on the semantics of each of those items in specific, but I’m pretty sure Count of Barcelos (Q58803899) is not a subclass of web page (Q36774). However, the individual parts seem pretty reasonable, so where does it go wrong then?

Luckily, the description of subclass of (P279) provides some good guidance:

all instances of these items are instances of those items; this item is a class (subset) of that item.

So let’s check those:

  1. All counts of Barcelos are Portuguese counts (conde), as Barcelos lies in Portugal
  2. All condes are counts
  3. Counts are not hereditary titles themselves, so I think this should be a instance of (P31). Also, counts are indirectly subclass of royal or noble rank through hereditary titles, but directly instance of royal or noble rank.
  4. Hereditary titles do not seem like an exact subclass of royal or noble ranks to me, but it passes the basic test
  5. I am unsure about royal or noble rank as subclass of rank, since the latter seems way more abstract, but perhaps that’s the point
  6. (rank -> first-order metaclass) I have no idea what the intention with these metaclasses is, so I’m going to assume this is all correct (bold move)
  7. first-order metaclass -> fixed-order metaclass
  8. fixed-order metaclass -> Wikidata metaclass: “class of classes, class whose instances are classes”
  9. Wikidata metaclass -> class or metaclass of Wikidata ontology (should maybe be P31 as well, still no clue though)
  10. class or metaclass of Wikidata ontology -> Wikidata item: finally back into some form of concrete-ness. Problem being, not all instances of royal or noble ranks are Wikidata items, so something went wrong somewhere in between, maybe even at (6).
  11. Wikidata item -> Wikidata internal entity
  12. Wikidata internal entity -> Wikidata internal item
  13. Wikidata internal item -> MediaWiki page
  14. MediaWiki page -> web page: actually pretty reasonable, instances of Wikidata items are web pages.

There are a little less than 900 instances of web page so the impact is smaller than I had expected, but it’s still annoying. There are 191 subclasses of conde, not to mention the total of 280 royal or noble ranks that have no business being a subclass of web page.

Tuesday, July 17, 2018

Journal Metadata: Authors & Institutions

Journal Metadata: Authors & Institutions

I finished the General Plugin system for Citation.js a few days ago (more on that later), so I could finally publish a new beta release. Now, after that half-finished piece of code had been blocking other work for a long while, I can at last start… fixing bugs, and closing other items in the backlog.

One of the items that has been on the backlog for a long time, and was on the backlog of the previous major version too, was sorting out BibJSON. BibJSON has been “supported” since before CSL-JSON was introduced as the internal standard, but under the name of ContentMine JSON, as I only knew it as the output of ContentMine’s quickscrape tool.

quickscrape output
quickscrape output (source, license: MIT)

Since then, I learned it actually was a more standardised format, but never got to the act of reading the standard and updating the parser. Today, however, I did. Turns out, it is something in between JSON-LD and BibTeX. While searching around for more comprehensive documentation, I saw the journal-scrapers (used by quickscrape) again, which I used to compile some test cases.

Unfortunately, one of the first examples went wrong already. The meta tags, containing the bibliographical data that quickscrape scrapes, specifically data pertaining to the authors, are not structured in a machine-friendly way, in my opinion. Certainly, quickscrape has trouble with it.

...
<meta name="citation_author" content="P. Pandikumar"/>
<meta name="citation_author_institution" content="Division of Ethnopharmacology, Entomology Research Institute, Loyola College, Chennai, India"/>
<meta name="citation_author" content="S. Ignacimuthu"/>
<meta name="citation_author_institution" content="Division of Ethnopharmacology, Entomology Research Institute, Loyola College, Chennai, India"/>
<meta name="citation_author_institution" content="International Scientific Partnership Programme, King Saud University, Riyadh, Saudi Arabia"/>
<meta name="citation_author" content="N. A. Al-Dhabi"/>
<meta name="citation_author_institution" content="Addiriyah Chair for Environmental Studies, College of Science, King Saud University, Riyadh, Saudi Arabia"/>
...

This particular example is from Biomed Central. However, the pattern persists throughout multiple journals: Nature (example), PLOS One (example), PeerJ (example), and probably many more, as these were just the first four I checked.

Prepend view-source: to those example URLs to quickly view the HTML source, with the meta tags.

The pattern is so similar, especially the authors always being after a whole list of citation_references in the case of Nature and BMC, that there must be some sort of library or service that generates these, I thought. This quest first led me to search what kind of tags citation_ are. The fact that the answer wasn’t very easy to find and the amount of unanswered questions I found along the way quickly made it clear what kind of quest this was going to be.

First of all, the tags: they’re called HighWire Press tags. Normally I would link a website, but I don’t think there is any. They’re the preferred method of metadata tagging of Google Scholar, which lists 16 tags, they’re also the preferred format of Mendeley, which points to the Google Scholar documentation, and yet the only thing I find searching for some canonical list is people asking where that list could be, and getting no answers (1, 2).

Even with the 16-tag list, I can find at least two tags, each non-trivial (e.g. citation_reference and citation_author_institution), in any of the examples mentioned above, that aren’t on that list. Not to mention that, again, those examples weren’t chosen, they were picked semi-randomly.

Luckily, I’m not the first one to run into this problem. Someone previously compiled a list of 39 citation_ tags based on observations, which is very useful if I want to write a crosswalk for Citation.js sometime, but doesn’t really help with finding a generator.

Back to HighWire: they claim Nature is one of their customers. BMC and Springer Open, however, aren’t, and yet they share the same system, or a common standard that can’t be found anywhere else. That they share a system makes sense, but what system and/or standard are they using? I asked, and will report back when I get an answer.

Friday, December 22, 2017

Citation.js Version 0.4 Beta: New Docs and Input Plugins

Citation.js Version 0.4 Beta: New Docs and Input Plugins

It’s been a while. Really. But now it’s back, with a new release: v0.4.0-0, the v0.4 beta. Below I explain some of the changes in this release, and then the road-map of Citation.js for v0.4 and v0.5. Also, Citation.js has a DOI now:

DOI

Also, @jsterlibs tweeted about Citation.js.

Release

Input plugins

The main change in this release, is the addition of input plugins. This is the first step towards releasing v0.4, as explained below. Although there’s specific documentation available here and here, I’ll put an example here as well, as the API can be confusing.

The API for registering input plugins will be changed once or twice before the release of v0.4, once to make it less complex and weird, and possibly a second time to incorporate output plugins.

Adding a format

Say you wanted to add the RIS format (I should probably do that sometime). First, let’s define a type, to set things off.

const type = '@ris/text'

I can’t really define the actual parser here, but I’ll add a variable in the example code.

const parse = data => { /* return CSL-JSON or some format supported by any other parser */ }

Testing if a given string is in the RIS format, we’ll use regex. This regex matches if any line starts with two alphanumerical characters, two spaces and a hyphen. That’s a pretty fuzzy match; all lines should start with that sequence. However, this is just an example, and proper regex would be less elegant.

const risRegex = /^\w{2}\ {2}-/gm

Now, let’s define the dataType of RIS input. When using regex to test input, the dataType is automatically determined to be 'String' anyway, but for the sake of being clear:

const dataType = 'String'

Now, to combine it all:

Cite.parse.add(type, {
  parse,
  parseType: risRegex,
  dataType
})

Changing parsers

Now, say someone else wrote that code above, and you need to use it without modifying it, but you want a better regex? That can be arranged:

Cite.parse.add(type, {
  dataType: 'String',
  parseType: /^(?:\w{2}\ {2}-.*\n)+(\w{2}\ {2}-.*)?$/g
})

Because the options dataType, parseType, elementConstraint and propertyConstraint are all treated as one thing, you need to pass everyone of those when replacing the type checker. dataType is still not actually mandatory in this example, but is passed to demonstrate this.

Disabling a format

There is currently no way to remove a format.

In this scenario, some plugin registered a new format to get citation data from github.com URLs. Unfortunately, it doesn’t work. Instead of recognising the URL as a GitHub-specific URL, the type is @else/url, the generic URL type. This is because the generic version was registered earlier, and there is not yet a category for generic types (see #104). If you don’t use the @else/url type checker anyway, you can disable it like this:

Cite.parse.add('@else/url', {dataType: 'String', parseType: () => false})

I’m not sure why the dataType is needed here, as you’re disabling the parser anyway, but it doesn’t seem to work without it.

Docs

Proper CLI docs

Recently, I’ve been updating the documentation on Citation.js, and it has been a pleasant experience, despite some hiccups in the documentation engine. There are tutorials now, and a lot of the JSDoc comments have been improved. I still want to improve some things in the theme, like clickable header links (similar to GitHub and NPM behaviour), and showing sub-tutorials in the navigation.

Backwards compatibility

This release should be largely backwards compatible. If there are any regressions, please report them in the bug tracker.

Road-map

v0.4

v0.4 “is about making it easier to expand on input and output formats, possibly by creating schemes and methods parsing those schemes that can, say, convert BibJSON to CSL JSON based on a JSON file, something that can be stored independently of implementation.”

The idea is to allow for all kinds of plugins, both in input parsing and output formatting, to be registered on the Cite object, and to treat all internal parsers and formatters the same. Currently, input parser plugins are possible (see above), although they will be improved before v0.4. Output parsers will be made soon too, and will be backwards-incompatible, because of the change in the output option format.

v0.5

Currently, the plan is to use JSON-LD as the internal format in v0.5, while still keeping CSL-JSON as the internal scheme. Any input should then be converted to a list (or object) of fields, which will be added to the internal data store. Each field should be transformed to the CSL-JSON scheme individually, and added as a copy. As a result, there won’t be data loss when certain fields aren’t available in CSL-JSON, but are in the input and output schemes. It should also make it easier to write parsers for e.g. BibJSON.

Of course, edge cases should be taken care of. For example, there is no Wikidata property for the CSL-JSON field original-title, but that information can still be derived from the labels combined with the P364 (original language) property.

Sunday, August 27, 2017

Citation.js Version 0.3 Released!

It’s been in beta since January 30, but here it is: Citation.js version 0.3. Below I explain the changes since the last blog post, and under that is a more complete change log. Also some upcoming changes.

Citation.js

Recent changes

Custom citation formatting

  • One of the remaining milestones for v0.3.0 was a better API for custom citation formatting, as outlined in the previous post. The exact implementation differs a bit, but is essentially the same. Docs can be found here.

Async

  • Cite#add() and Cite#set() now have asynchronous siblings, Cite#addAsync() and Cite#setAsync().
  • Both async and sync test cases are needed for full coverage, so I won’t just change one into another, as I previously suggested.

Parsers

  • The BibTeX parser got a big update, improving both the text-parsing code and the JSON-transforming code.
  • The Wikidata issue (sorry for the GIF) is fixed now, and I like the current API and code. The solution is inspired by the recent BibTeX refactoring.

Older changes

Also look at previous blog posts for more info. List goes from newer to older.

Features

  • Code coverage
  • Test case and framework updates
  • DOI support
  • CLI fixes
  • Custom sorting
  • Cite as an Iterator
  • Bib.TXT support
  • Give jQuery.Citation.js its own repository
  • Async support
  • Build size disc & dependency badges
  • Exposing internal functions and splitting up the main file into smaller files
  • Change of some internal Cite properties
  • Using an actual code style and use of ES6+
  • Exposing version numbers

Bugs

  • CLI bugs
  • Variable name typos
  • Output JSON is now valid JSON (I know, right?)
  • Wikidata ID parsing
  • Cite#getIds()
  • CLI not working
  • Wikidata tests not working
  • CSL not working
  • Browserify not working

Upcoming changes

  • Web scrapers and schema.org, Open Graph, etc.
  • Input parsing extensions
  • More coverage
  • More input formats, like better support for BibJSON, and new formats.

Saturday, August 26, 2017

Citation.js: DOI update and more stability

Citation.js: DOI update and more stability

Finally, Citation.js supports DOIs. It took a while, but it’s finally there. One big ‘but’: synchronous fetching doesn’t work in Chrome. I’m still looking into that, but I should be recommending you to use Cite.async() anyway. Also in this blog post: more stability in Cite#get(), a welcome byproduct of using the DOI API, and looking forward (again).

Citation.js logo

DOI support

So, DOIs. That was (and is) a though one. Let me guide you through the process.

Initial development

I have been planning to add support for DOI input since the beginning of this year, going off the original feature request, and the pressure to implement it has grew right along me realising how important they are in the world of citations. Back then I thought it would be smart to query Wikidata for DOIs, as I had just finished some code on Wikidata ID input, that used regular API as well as the query part. More recently however, I learned about the Crossref API and, even more helpfully, the DOI Content Negotiation API, which combines the APIs from Crossref, DataCite and mEDRA. I’ll quote a piece from the docs:

Content negotiation allows a user to request a particular representation of a web resource. DOI resolvers use content negotiation to provide different representations of metadata associated with DOIs.

A content negotiated request to a DOI resolver is much like a standard HTTP request, except server-driven negotiation will take place based on the list of acceptable content types a client provides.

source

Basically, you just make a request to https://doi.org/$DOI (where hopefully obviously $DOI stands for the DOI you’re looking for) with an Accept header set to the format you want your data in. Conveniently, one of those formats is CSL-JSON, here charmingly named application/vnd.citationstyles.csl+json, but nonetheless the direct input format for Citation.js. This would allow me to only have to write code to extract DOIs from a whitespace-separated list (const dois = string.trim().split(/\s+/g)) and fetch them from the server (await Promise.all(dois.map(fetchDoi))), where fetchDoi is a simple function:

async function fetchDoi (doi) {
  const headers = new Headers({
    Accept: 'application/vnd.citationstyles.csl+json'
  })
  const response = await fetch('https://doi.org/' + doi, {headers})
  return response.json()
}

And Headers and fetch are built-in. Note that this uses some advanced syntax. Read more about async/await, Promise.all(), shorthand property names and const.

First problems

It wasn’t that simple. Heck, even getting the API to work took a (long) while and a lot of debugging other people’s code. You see, to make synchronous requests in Node.js I use sync-request, which is a synchronous wrapper around its asynchronous cousin, then-request, which is a wrapper around an also asynchronous but more low-level cousin, http-basic. They all use the same scheme of options. Because of that, sync-request passes its options down all the way to http-basic, so options from http-basic but not in sync-request can be used too. Turns out, http-basic has an option that removes all headers when a request is redirected, but it isn’t documented in sync-request. This removes, among other things, the Accept header, which, on omission, produces an HTTP 501 code not documented in the API.

Let’s just say I filed a feature request to mention this in the docs.

The actual code didn’t need much change, so soon I got the first JSON response, and since the API supports CSL-JSON out of the box, I didn’t need to do much else, except building some input recognition infrastructure.

CSL-JSON problems

However, nothing is perfect, and so is that out-of-the-box support of CSL-JSON. This was actually picked up quite well, and I’m happy with the outcome. Basically, extra code needed to fix this consists of two parts. One part transforms some invalid but essential parts of the API response to its CSL-JSON counterpart. The second part filters invalid and less essential parts out of the data, and basically acts as a safeguard for methods depending on certain variables having certain types. More on this later.

More problems, from an unexpected source

Yep, it isn’t even over yet. This may be the biggest problem of them all, in that it hasn’t got a solution yet, and that I don’t expect it to have one in the near future, and that it is very likely there is no solution. It depends, it really does. Anyway, let’s get to the point.

Chrome doesn’t support synchronous redirects from CORS domains. Or something. Even that isn’t really clear. Basically, the problem is that generally, the DOI content negotiation works like this (assuming we’re in the browser): the user domain requests data from https://doi.org, which redirects to https://data.crossref.org (or a different domain). Both are different domains. Both synchronous and asynchronous requests directly to https://data.crossref.org (or a different domain) work. Asynchronous requests via https://doi.org work too. Even synchronous requests via https://doi.org work, but not in Chrome. I don’t know why, I don’t know when (as it does support normal synchronous CORS requests), and I can’t find any document that says why it does that. The only things that have shed some light on this are a comment I can’t find anymore that only stated you can’t do that at all and this answer, that almost describes the exact problem I’m having, but fails to give any answer other than that it’s weird that Firefox and IE “don’t follow the jQuery spec.” Note that the jQuery spec doesn’t mention anything about this apart from the note that this shouldn’t be done ever anyway.

One obvious answer is that every mayor browser is currently in the process of factoring out all synchronous effect, and that Chrome might have taken this a step further, but even then, there should be some document somewhere that says it does this, otherwise I’d just consider it a bug. I haven’t gotten around to it yet, but I will try to see how this would affect using Citation.js synchronously in Web Workers. If that works, it means that it actually is another rule to prevent synchronous requests on the main thread, which is fine by me, although slightly inconvenient for people who do not care about user experience, such as me when I’m lazy.

Conclusion

DOIs work. Because of all the ranting, I haven’t shown you the api, but it’s pretty familiar. To use the CLI, do this:

> npm i -g citation-js
# to install the new version of the command
# might need to be run with admin rights

> citation-js -t 10.1021/ja01577a030 -f string -s citation-apa
# Fetches data from doi: 10.1021/ja01577a030
# in the format: string; in the formatted style: apa
 
  Hall, H. K., Jr. Correlation of the Base Strengths of Amines1. Journal of the American Chemical Society, 79(20), 5441–5444. https://doi.org/10.1021/ja01577a030

To use the API, do this:

const Cite = require('citation-js')

// For synchronous use
const data = new Cite('10.1021/ja01577a030')
const output = data.get({
  type: 'string',
  style: 'citation-apa'
})

/* output =
  Hall, H. K., Jr. Correlation of the Base Strengths of Amines1. Journal of the American Chemical Society, 79(20), 5441–5444. https://doi.org/10.1021/ja01577a030
*/

// For asynchronous use
Cite.async('10.1021/ja01577a030').then(data => {
  const output = data.get({
    type: 'string',
    style: 'citation-apa'
  })
  
  /* output =
    Hall, H. K., Jr. Correlation of the Base Strengths of Amines1. Journal of the American Chemical Society, 79(20), 5441–5444. https://doi.org/10.1021/ja01577a030
  */
})

Stability in Cite#get()

As I mentioned earlier, I had to write some code to filter out invalid but non-essential props from the CSL-JSON. Otherwise certain methods used by Cite#get() throw errors, as they except these props to have certain types. Because of this filtering function, I don’t have to type-check ever again, I hope. The implementation is pretty simple. Specially structured props like author and other names, and issued and other dates, get caught and handled on their own. Other props get checked against a map of expected data types and removed if they don’t match, and if the bestGuessConversions flag is set, if they also can’t be converted reliably. Note that, as of now, all Cite.get.*() functions expect CSL-JSON cleaned by Cite.parse.csl().

Version 0.3.0

Implementing DOI input was one of the big milestones on the way to version 0.3.0, besides exposing internal methods, making async input parsing available and generally making the browser version and the CLI less bad. Now that those three things are done, I think it’s a good moment to see what still needs to be done for the v0.3.0 release.

Better custom formatted citations

Using formatted citations is fine, but when you try to use custom ones, either because the style guide you want or need to follow isn’t built-in, or because you have some special use case, things may get confusing. Currently, the API works (should work) like this: when you pass a template in the Cite#get() options (important: not in the new Cite(<data>, <options>) options), it gets registered in a register of citeproc-js engines. After that, you can use it by referencing the name you used in the regular style option. This would work better with an API similar to this:

const template = '...'
const templateName = 'custom'

// Suggested new API
Cite.CSL.template.register(templateName, template)

// Nothing new, will stay the same
const data = new Cite(...)
data.get({
  type: 'html',
  style: 'citation-' + templateName
})

Also, sometimes you just want to append or prepend some text or HTML frames, and implementing that with CSL templates takes time and effort and the result isn’t always that pretty, as templates don’t support direct HTML. There will be a new API for that too, probably like this:

// prepends the id like this: "[$ID]: "
const prepend = ({id}) => `[${id}]: `
// appends an altmetric widget
const append = ({DOI}) => `<span class='altmetric-embed' data-doi='${DOI}'></span>`
// Nothing new, will stay the same
const data = new Cite(...)
data.get({
  type: 'html',
  style: 'citation-' + templateName,
  // Suggested new API
  // both properties either function or constant string
  append: append,
  prepend: prepend
})

One thing I still have to look at is extending the wrapping HTML elements outputted by citeproc-js.

Use asynchronism better

Now that we have asynchronous input parsing with Cite.async(), it may be interesting to look at asynchronous output formatting as well. There are also some functions that can use Cite.async() but don’t, like Cite#add() and Cite#set(), and the test cases are generally synchronous, with some special cases for async, while it should probably be the other way around. Adding a coverage tester will assist in determining which synchronous test cases can go.

Refactoring

There are still some things I’d like to refactor, like the Wikidata parser. I don’t like it right now, especially the hack to merge different types of authors.

After v0.3.0

Since we’re almost at the end of v0.3.0, I’ll also outline some of my plans for future versions.

  • BibJSON input (already partly supported)
  • Extensions (input parsing, output, etc.)
  • Zotero input (maybe not worth the work, as they already support export to CSL)
  • Scraper (could just be getting DOI and using my existing work)
  • Coverage testing and an expansion of test cases
  • More work on the dependants: