Wednesday, July 18, 2018

Citation.js: Use Case for a Wikidata GraphQL API

Citation.js: Use Case for a Wikidata GraphQL API

Citation.js has supported Wikidata input for a long time. However, I’ve always had some trouble with the API. See, when Citation.js processes Wikidata API output (which looks like this) and gets to, say, the P50 (author) property, it encounters this:

"P50": [
	{
		"mainsnak": {
			"snaktype": "value",
			"property": "P50",
			"hash": "1202966ec4cf715d3b9ff6faba202ac6c6ac3df8",
			"datavalue": {
			"value": {
				"entity-type": "item",
				"numeric-id": 2062803,
				"id": "Q2062803"
			},
			"type": "wikibase-entityid"
			},
			"datatype": "wikibase-item"
		},
		"type": "statement",
		"id": "Q46601020$692cc18d-4f54-eb65-8f0a-2fbb696be564",
		"rank": "normal"
	}
]

The problem with this is that there’s no name string readily available: to get the name of this author, and of any author, journal, publisher, etcetera, Citation.js has to make extra queries to the API, to get the data.

In the case of people, you could then just grab the label, but there’s also P735 (given name) and P734 (family name) in Wikidata. That saves some error-prone name parsing, you might think. However, this is what the API output looks like:

{
    "P735":[
        {
            "mainsnak":{
                "snaktype":"value",
                "property":"P735",
                "hash":"26c75e68a9844db73d0ff2e0da5652c5d571e46d",
                "datavalue":{
                    "value":{
                        "entity-type":"item",
                        "numeric-id":15635262,
                        "id":"Q15635262"
                    },
                    "type":"wikibase-entityid"
                },
                "datatype":"wikibase-item"
            },
            "type":"statement",
            "id":"Q22581$3554EADD-B8D8-4506-905B-014823ECC3EA",
            "rank":"normal"
        }
    ],
    "P734":[
        {
            "mainsnak":{
                "snaktype":"value",
                "property":"P734",
                "hash":"030e6786766f927e67ed52380f984be79d0f6111",
                "datavalue":{
                    "value":{
                        "entity-type":"item",
                        "numeric-id":41587275,
                        "id":"Q41587275"
                    },
                    "type":"wikibase-entityid"
                },
                "datatype":"wikibase-item"
            },
            "type":"statement",
            "id":"Q22581$598DF0D7-CEC7-470B-8D0F-DD320796BF01",
            "rank":"normal"
        }
    ]
}

Another two dead ends, another two (one, with some effort) API calls. It would be great if it was possible to get this data with a single API call. I think GraphQL would be a good option here. With GraphQL, you can specify exactly what data you want. I’m not the first one to think of this; in fact, a simple example is already implemented. This is what a query would look like (variables: {"item": "Q30000000"}): Try it online!

query ($item: ID!) {
  entry: item(id: $item) {
    # get every property
    # to get specific properties, use "statements(properties: [...])"
    claims: statements {
      mainsnak {
        ... on PropertyValueSnak {
          # get property id and English label
          property {
            id
            name: label(language: "en") {
              text
            }
          }
          # get value
          value {
            ... on MonolingualTextValue {
              value: text
            }
            ... on StringValue {
              value
            }
            # if value is an item, get the label too
            ... on Item {
              id
              label(language: "en") {
                text
              }
            }
            ... on QuantityValue {
              amount
              unit {
                value: label(language: "en") {
                  text
                }
              }
            }
            ... on TimeValue {
              value: time
            }
          }
        }
      }
    }
  }
}

Another handy thing is that the API output is basically the equivalent of the query in JSON, but with the data filled in. I think a GraphQL API would be really useful for this and similar use cases, and it definitely seems possible given the fact that there is an experimental API available.

Tuesday, July 17, 2018

Journal Metadata: Authors & Institutions

Journal Metadata: Authors & Institutions

I finished the General Plugin system for Citation.js a few days ago (more on that later), so I could finally publish a new beta release. Now, after that half-finished piece of code had been blocking other work for a long while, I can at last start… fixing bugs, and closing other items in the backlog.

One of the items that has been on the backlog for a long time, and was on the backlog of the previous major version too, was sorting out BibJSON. BibJSON has been “supported” since before CSL-JSON was introduced as the internal standard, but under the name of ContentMine JSON, as I only knew it as the output of ContentMine’s quickscrape tool.

quickscrape output
quickscrape output (source, license: MIT)

Since then, I learned it actually was a more standardised format, but never got to the act of reading the standard and updating the parser. Today, however, I did. Turns out, it is something in between JSON-LD and BibTeX. While searching around for more comprehensive documentation, I saw the journal-scrapers (used by quickscrape) again, which I used to compile some test cases.

Unfortunately, one of the first examples went wrong already. The meta tags, containing the bibliographical data that quickscrape scrapes, specifically data pertaining to the authors, are not structured in a machine-friendly way, in my opinion. Certainly, quickscrape has trouble with it.

...
<meta name="citation_author" content="P. Pandikumar"/>
<meta name="citation_author_institution" content="Division of Ethnopharmacology, Entomology Research Institute, Loyola College, Chennai, India"/>
<meta name="citation_author" content="S. Ignacimuthu"/>
<meta name="citation_author_institution" content="Division of Ethnopharmacology, Entomology Research Institute, Loyola College, Chennai, India"/>
<meta name="citation_author_institution" content="International Scientific Partnership Programme, King Saud University, Riyadh, Saudi Arabia"/>
<meta name="citation_author" content="N. A. Al-Dhabi"/>
<meta name="citation_author_institution" content="Addiriyah Chair for Environmental Studies, College of Science, King Saud University, Riyadh, Saudi Arabia"/>
...

This particular example is from Biomed Central. However, the pattern persists throughout multiple journals: Nature (example), PLOS One (example), PeerJ (example), and probably many more, as these were just the first four I checked.

Prepend view-source: to those example URLs to quickly view the HTML source, with the meta tags.

The pattern is so similar, especially the authors always being after a whole list of citation_references in the case of Nature and BMC, that there must be some sort of library or service that generates these, I thought. This quest first led me to search what kind of tags citation_ are. The fact that the answer wasn’t very easy to find and the amount of unanswered questions I found along the way quickly made it clear what kind of quest this was going to be.

First of all, the tags: they’re called HighWire Press tags. Normally I would link a website, but I don’t think there is any. They’re the preferred method of metadata tagging of Google Scholar, which lists 16 tags, they’re also the preferred format of Mendeley, which points to the Google Scholar documentation, and yet the only thing I find searching for some canonical list is people asking where that list could be, and getting no answers (1, 2).

Even with the 16-tag list, I can find at least two tags, each non-trivial (e.g. citation_reference and citation_author_institution), in any of the examples mentioned above, that aren’t on that list. Not to mention that, again, those examples weren’t chosen, they were picked semi-randomly.

Luckily, I’m not the first one to run into this problem. Someone previously compiled a list of 39 citation_ tags based on observations, which is very useful if I want to write a crosswalk for Citation.js sometime, but doesn’t really help with finding a generator.

Back to HighWire: they claim Nature is one of their customers. BMC and Springer Open, however, aren’t, and yet they share the same system, or a common standard that can’t be found anywhere else. That they share a system makes sense, but what system and/or standard are they using? I asked, and will report back when I get an answer.