Showing posts with label Wikidata. Show all posts
Showing posts with label Wikidata. Show all posts

Friday, February 2, 2024

Three new userscripts for Wikidata

Today I worked on three user scripts for Wikidata. Together, these tools hopefully make the data in Wikidata more accessible and make it easier to navigate between items. To enable these, include one or more of the following lines in your common.js (depending on which script(s) you want):

mw.loader.load('//www.wikidata.org/w/index.php?title=User:Lagewi/properties.js&oldid=2039401177&action=raw&ctype=text/javascript');
mw.loader.load('//www.wikidata.org/w/index.php?title=User:Lagewi/references.js&oldid=2039248554&action=raw&ctype=text/javascript');
mw.loader.load('//www.wikidata.org/w/index.php?title=User:Lagewi/bibliography.js&oldid=2039245516&action=raw&ctype=text/javascript');

User:Lagewi/properties.js

Note: It turns out there is a Gadget, EasyQuery, that does something similar to this user script.

Inspired by the interface for observation fields on iNaturalist, I wanted to easily find entities that also used a certain property, or that had a specific property-value combination. This script adds two types of links:

  • For each property, a link to query results listing entities that have that property.
  • For each statement, a link to query results listing entities that have that claim, e.g. this property-value combination. (This does not account for qualifiers.)

These queries are certainly not useful for all properties: listing instance of (P31) human (Q5) is not particularly meaningful in the context of a specific person. However, sometimes it just helps to find other items that use a property, and listing other compounds found in taxon (P703) Pinus sylvestris (Q133128) is interesting.

https://www.wikidata.org/wiki/User:Lagewi/properties.js

Screenshot of three claims on a Wikidata page. Below each property is a link titled "List entities with this property", and below each value a link titled "List entities with this claim".

User:Lagewi/references.js

Sometimes, the data on Wikidata does not answer all your questions. Some types of information are difficult to encode in statements, or simply has not been encoded on Wikidata yet. In such cases, it might be useful to go through the references attached to claims of the entity, for additional information. To simplify this process, this user script lists all unique references based on stated in (P248) and reference URL (P854). The references are listed in a collapsible list below the table of labels and descriptions, collapsed by default to not be obtrusive.

https://www.wikidata.org/wiki/User:Lagewi/references.js

Screenshot of the Wikidata page of Sylvie Deleurance (Q122350848) with, below the list of labels, descriptions, and aliases, a collapsible list titled "Other resources", containing links.

User:Lagewi/bibliography.js

Going a step further than the previous script, this user script appends a full bibliography at the bottom of the page. This uses the (relatively) new citation-js tool hosted on Toolforge (made using Citation.js). Every reference in the list of claims has a footnote link to its entry in the bibliography, and every entry in the bibliography has a list of back-references to the claims where the reference is used, labeled by the property number.

https://www.wikidata.org/wiki/User:Lagewi/bibliography.js

Screenshot of two claims on Wikidata, each with a reference. Both references have a darker grey header containing a link, respectively "[4]" and "[5]".

Screenshot of a numbered list titled "Bibliography". Each item in the list is a formatted reference, and below each reference is a series of links titled with a property number.

Saturday, May 25, 2019

Citation.js: Wikidata Update

Citation.js: Wikidata Update

The new update, v0.4.4, contains a few Wikidata improvements (commit):

  • 22 new mappings
  • 2 fixed mappings (ISSN did not work and publisher-place was mapped to the wrong thing)
  • 2 improved mappings (container-title for chapters and more URL mappings)
  • 1 removed mapping (genre was inconsistent with the intended use, although it followed the specification)

Because most of these mappings require additional resources (recipient has to fetch people, review-* has to fetch the review subject and original-publisher-place has to fetch the country and location of the publisher of the original version of a work) this would have caused a lot more requests to the API. It is a good thing, then, that there is a new resolver in place, which pre-fetches such information for all requested works at once.

This has to be done in levels; you cannot fetch the country if you do not have the location, the location requires the publisher and to find out the publisher you have to have the work already. However, temporarily ignoring the cap of 50 items per request, the number of requests is now only based on those levels (which is at most five), instead of the number of claims requiring additional requests.

For example, item Q21972834 (Assembling the 20 Gb white spruce (Picea glauca) genome from whole-genome shotgun sequencing data) takes:

  • 6 requests in v0.4.0-rc.2, before requests for properties were grouped (so it would make a request for every author)
  • 3 requests in v0.4.3: one for the work, one for the authors, and one for journal
  • 2 requests in v0.4.4, which includes a bunch of new mappings that would have cost a total of 5 requests in the previous version

Grouping the requests from all works has the added benefit of not having to fetch the same author and journal info repeatedly — which can be quite common for bibliographies. For example, the bibliography of Groovy Cheminformatics with the Chemistry Development Kit, currently containing 74 items, takes 11 requests in the new version. For the same bibliography, the previous version would make at least 150 requests — for each item one for all the authors and one for the journal — and probably more than that.

[core] GET https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q24599948%7CQ24650571%7CQ25713029%7CQ27061829%7CQ27062363%7CQ27062596%7CQ27065423%7CQ27093381%7CQ27134682%7CQ27134746%7CQ27134827%7CQ27162658%7CQ27211680%7CQ27499209%7CQ27656255%7CQ27783585%7CQ27783587%7CQ27902272%7CQ28090714%7CQ28133283%7CQ28186592%7CQ28837846%7CQ28837922%7CQ28837925%7CQ28837939%7CQ28837943%7CQ28837947%7CQ28842810%7CQ28842968%7CQ28843132%7CQ29039683%7CQ29042322%7CQ29616639%7CQ30149558%7CQ30853915%7CQ31127242%7CQ33874102%7CQ34160151%7CQ34206190%7CQ36662828%7CQ37988904%7CQ39811432%7CQ42704791%7CQ47543807%7CQ47632144%7CQ54062338%7CQ55880270%7CQ55934414%7CQ55954394%7CQ56112883&format=json&languages=en {}  
[core] GET https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q56170978%7CQ56454405%7CQ57836257%7CQ59771351%7CQ60167226%7CQ60167327%7CQ60167615%7CQ60167690%7CQ61463648%7CQ61649587%7CQ61779181%7CQ61779373%7CQ61779901%7CQ61779905%7CQ61779940%7CQ61780124%7CQ62926016%7CQ62926155%7CQ62927888%7CQ62968825%7CQ62969352%7CQ62969354%7CQ63367539%7CQ63367548&format=json&languages=en {}  
[core] GET https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q135122%7CQ1223539%7CQ59196164%7CQ1860%7CQ8513%7CQ766195%7CQ7441%7CQ28946414%7CQ50332566%7CQ46330054%7CQ57218960%7CQ57218835%7CQ55965812%7CQ43370919%7CQ37391332%7CQ57677801%7CQ60446154%7CQ60447576%7CQ60449513%7CQ29946263%7CQ60465926%7CQ60890297%7CQ20895241%7CQ910067%7CQ3007982%7CQ52113739%7CQ5111731%7CQ29405902%7CQ47473872%7CQ121182%7CQ128570%7CQ910164%7CQ2383032%7CQ1130645%7CQ908710%7CQ5727848%7CQ27065426%7CQ28946652%7CQ42783959%7CQ4420286%7CQ749647%7CQ309823%7CQ28540892%7CQ29052381%7CQ29052386%7CQ4914910%7CQ12149006%7CQ853614%7CQ1418791%7CQ50731867&format=json&languages=en {}  
[core] GET https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q56007851%7CQ5195068%7CQ11351%7CQ75%7CQ151332%7CQ3002926%7CQ27061944%7CQ27134800%7CQ6294930%7CQ27061853%7CQ28540435%7CQ28854723%7CQ766383%7CQ1425625%7CQ28540616%7CQ55406442%7CQ483666%7CQ11173%7CQ39972290%7CQ1069211%7CQ36534%7CQ27211732%7CQ251%7CQ1768406%7CQ1689854%7CQ30046697%7CQ55406542%7CQ5227350%7CQ38372872%7CQ38373802%7CQ3841253%7CQ55213915%7CQ7440973%7CQ30045595%7CQ451553%7CQ466769%7CQ203250%7CQ54837%7CQ3355939%7CQ40023319%7CQ26842658%7CQ109081%7CQ82264%7CQ898902%7CQ27711423%7CQ1767639%7CQ7209103%7CQ27768873%7CQ900316%7CQ48803&format=json&languages=en {}  
[core] GET https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q1884753%7CQ15064016%7CQ895901%7CQ2166958%7CQ907701%7CQ309%7CQ28796322%7CQ19845641%7CQ27061849%7CQ28923506%7CQ30046701%7CQ28865170%7CQ43370334%7CQ29387575%7CQ50983316%7CQ7395247%7CQ192864%7CQ336658%7CQ58409692%7CQ56421913%7CQ55182163%7CQ55182165%7CQ56670283%7CQ925779%7CQ50731930%7CQ909510%7CQ381009%7CQ38173%7CQ30046335%7CQ43370883%7CQ3705921%7CQ1154615%7CQ2334061%7CQ1988917%7CQ898967%7CQ2425378%7CQ10354104%7CQ47  480%7CQ1709878%7CQ900502&format=json&languages=en {}  
[core] GET https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q1120519%7CQ1860%7CQ113337%7CQ1767639%7CQ27134800%7CQ6294930%7CQ251%7CQ5776092%7CQ56421877%7CQ463360%7CQ109081%7CQ50731930%7CQ176916%7CQ26707540%7CQ30046697%7CQ28925563%7CQ1122491%7CQ3186908%7CQ969707%7CQ1948400%7CQ192864%7CQ898902%7CQ2835897%7CQ3007982%7CQ46155617%7CQ905549%7CQ485223%7CQ900502%7CQ15766522&format=json&languages=en {}  
[core] GET https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q7050%7CQ32%7CQ64%7CQ225471&format=json&languages=en {}  
[core] GET https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q47501431&format=json&languages=en {}  
[core] GET https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q33467704&format=json&languages=en {}  
[core] GET https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q55&format=json&languages=en {}  
[core] GET https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q183%7CQ145&format=json&languages=en {}  

However, I do notice slightly more requests with a low number of items than I would expect. I will look into that.

Conference papers in Wikidata

Part of the mappings that were added in the recent update are the event-* properties, i.e. “name of the related event (e.g. the conference name when citing a conference paper)” (spec). Finding out how those are modelled in Wikidata proved a bit of a challenge, as most instances of Q23927052 (conference paper) are published in proceedings with the book type. To get some numbers (query):

type typeLabel items
Q571 book 715
Q5633421 scientific journal 676
Q1143604 proceedings 315
Q16024164 medical journal 53
Q23927052 conference paper 48
Q1002697 periodical literature 32
Q41298 magazine 29
other other 61

48 conference papers are published in conference papers? That does not seem right. But maybe they just have the wrong type, but still link to the event? Well no, not really (query).

hasEvent items
false 1952
true 6

Luckily, the information is not lost: usually, the label contains location and date information about the event, although extracting that would be probably be a tedious task. However, perhaps there are a lot of proper proceedings out there, but the articles linked to it just have the wrong type (query)?

type typeLabel items
Q13442814 scholarly article 3513
Q23927052 conference paper 315
Q3331189 version, edition, or translation 11
Q1143604 proceedings 9
Q10885494 scientific conference paper 7
Q1980247 chapter 6
Q191067 article 5
Q18918145 academic journal article 4
Q333291 abstract 3
other other 10

That seems to be the case: most works that are published in instances of proceedings are tagged as instances of scholarly articles, and only nine are tagged as both. And: relatively many have events linked to them (query).

hasEvent items
true 2692
false 1184

That’s better. I’ll look into working together with WikiCite to work on the rest.

Thursday, January 3, 2019

Citation.js: Wikidata Subclasses

Citation.js: Wikidata Subclasses

I am in the process of creating a better way to map CSL types to Wikidata types together with Jakob Voß, and while listing all subtypes of web page, I came across a number of these cases (link):

#defaultView:Graph
SELECT ?a ?aLabel ?b ?bLabel WHERE {
  wd:Q58803899 wdt:P279* ?a .
  ?a wdt:P279 ?b .
  FILTER EXISTS {
    ?b wdt:P279* wd:Q386724 .
  }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}

I’m no expert on Wikidata data models, nor on the semantics of each of those items in specific, but I’m pretty sure Count of Barcelos (Q58803899) is not a subclass of web page (Q36774). However, the individual parts seem pretty reasonable, so where does it go wrong then?

Luckily, the description of subclass of (P279) provides some good guidance:

all instances of these items are instances of those items; this item is a class (subset) of that item.

So let’s check those:

  1. All counts of Barcelos are Portuguese counts (conde), as Barcelos lies in Portugal
  2. All condes are counts
  3. Counts are not hereditary titles themselves, so I think this should be a instance of (P31). Also, counts are indirectly subclass of royal or noble rank through hereditary titles, but directly instance of royal or noble rank.
  4. Hereditary titles do not seem like an exact subclass of royal or noble ranks to me, but it passes the basic test
  5. I am unsure about royal or noble rank as subclass of rank, since the latter seems way more abstract, but perhaps that’s the point
  6. (rank -> first-order metaclass) I have no idea what the intention with these metaclasses is, so I’m going to assume this is all correct (bold move)
  7. first-order metaclass -> fixed-order metaclass
  8. fixed-order metaclass -> Wikidata metaclass: “class of classes, class whose instances are classes”
  9. Wikidata metaclass -> class or metaclass of Wikidata ontology (should maybe be P31 as well, still no clue though)
  10. class or metaclass of Wikidata ontology -> Wikidata item: finally back into some form of concrete-ness. Problem being, not all instances of royal or noble ranks are Wikidata items, so something went wrong somewhere in between, maybe even at (6).
  11. Wikidata item -> Wikidata internal entity
  12. Wikidata internal entity -> Wikidata internal item
  13. Wikidata internal item -> MediaWiki page
  14. MediaWiki page -> web page: actually pretty reasonable, instances of Wikidata items are web pages.

There are a little less than 900 instances of web page so the impact is smaller than I had expected, but it’s still annoying. There are 191 subclasses of conde, not to mention the total of 280 royal or noble ranks that have no business being a subclass of web page.

Wednesday, July 18, 2018

Citation.js: Use Case for a Wikidata GraphQL API

Citation.js: Use Case for a Wikidata GraphQL API

Citation.js has supported Wikidata input for a long time. However, I’ve always had some trouble with the API. See, when Citation.js processes Wikidata API output (which looks like this) and gets to, say, the P50 (author) property, it encounters this:

"P50": [
	{
		"mainsnak": {
			"snaktype": "value",
			"property": "P50",
			"hash": "1202966ec4cf715d3b9ff6faba202ac6c6ac3df8",
			"datavalue": {
			"value": {
				"entity-type": "item",
				"numeric-id": 2062803,
				"id": "Q2062803"
			},
			"type": "wikibase-entityid"
			},
			"datatype": "wikibase-item"
		},
		"type": "statement",
		"id": "Q46601020$692cc18d-4f54-eb65-8f0a-2fbb696be564",
		"rank": "normal"
	}
]

The problem with this is that there’s no name string readily available: to get the name of this author, and of any author, journal, publisher, etcetera, Citation.js has to make extra queries to the API, to get the data.

In the case of people, you could then just grab the label, but there’s also P735 (given name) and P734 (family name) in Wikidata. That saves some error-prone name parsing, you might think. However, this is what the API output looks like:

{
    "P735":[
        {
            "mainsnak":{
                "snaktype":"value",
                "property":"P735",
                "hash":"26c75e68a9844db73d0ff2e0da5652c5d571e46d",
                "datavalue":{
                    "value":{
                        "entity-type":"item",
                        "numeric-id":15635262,
                        "id":"Q15635262"
                    },
                    "type":"wikibase-entityid"
                },
                "datatype":"wikibase-item"
            },
            "type":"statement",
            "id":"Q22581$3554EADD-B8D8-4506-905B-014823ECC3EA",
            "rank":"normal"
        }
    ],
    "P734":[
        {
            "mainsnak":{
                "snaktype":"value",
                "property":"P734",
                "hash":"030e6786766f927e67ed52380f984be79d0f6111",
                "datavalue":{
                    "value":{
                        "entity-type":"item",
                        "numeric-id":41587275,
                        "id":"Q41587275"
                    },
                    "type":"wikibase-entityid"
                },
                "datatype":"wikibase-item"
            },
            "type":"statement",
            "id":"Q22581$598DF0D7-CEC7-470B-8D0F-DD320796BF01",
            "rank":"normal"
        }
    ]
}

Another two dead ends, another two (one, with some effort) API calls. It would be great if it was possible to get this data with a single API call. I think GraphQL would be a good option here. With GraphQL, you can specify exactly what data you want. I’m not the first one to think of this; in fact, a simple example is already implemented. This is what a query would look like (variables: {"item": "Q30000000"}): Try it online!

query ($item: ID!) {
  entry: item(id: $item) {
    # get every property
    # to get specific properties, use "statements(properties: [...])"
    claims: statements {
      mainsnak {
        ... on PropertyValueSnak {
          # get property id and English label
          property {
            id
            name: label(language: "en") {
              text
            }
          }
          # get value
          value {
            ... on MonolingualTextValue {
              value: text
            }
            ... on StringValue {
              value
            }
            # if value is an item, get the label too
            ... on Item {
              id
              label(language: "en") {
                text
              }
            }
            ... on QuantityValue {
              amount
              unit {
                value: label(language: "en") {
                  text
                }
              }
            }
            ... on TimeValue {
              value: time
            }
          }
        }
      }
    }
  }
}

Another handy thing is that the API output is basically the equivalent of the query in JSON, but with the data filled in. I think a GraphQL API would be really useful for this and similar use cases, and it definitely seems possible given the fact that there is an experimental API available.

Sunday, November 27, 2016

Weekly Report 12: Including table facts in Wikidata

This week, the big achievement is the addition of a multi-step form to add the semantic triples from last week to Wikidata with QuickStatements, which we talked about before too. The new '+' icon in table rows now links to a page where you can curate the statement and add Wikidata IDs where necessary. At the last step, you get a table of the existing data, the added identifiers and soon their Wikidata label. There, you can open the statement in QuickStatements, where it will be added to Wikidata. This is a delicate procedure; it's important to keep an eye on the validity of the statement.

The multi-step form in action (see tweet)

Take the following table:

Fungus Insect Host
Fungus A Insect 1 Host 1
Fungus B Insect 2 Host 2

This is currently interpreted (by the program) as "Fungus A has Insect 1 as Insect and Host 1 as Host", while the actual meaning is "Fungus A is spread by Insect Insect 1, of which the Host is Host 1" (see this table). While stating Host 1 is the Host of Fungus A is arguably correct, this occurs in much worse ways as well, and it's hard to account for it.

On top of that, often table values don't have the exact data, but abbreviations. It shouldn't be too hard to annotate abbreviations with their meaning before parsing the tables, but actually linking them to Wikidata Entities is still a problem. This is, obviously, the next step, and I'll continue to work on it the next few weeks. One thing I could do is search for entities based on their label. A problem with that is that I would have to make thousands of calls to the Wikidata API in a short period of time, so I'll probably create a dictionary with one, big call, if the resulting size allows it.

There is a pull request pending that would solve the issue norma has with tables, so I'll incorporate that into the program as well, making it easier for me as some things will already be annotated.

Conclusion: the 'curating' part of adding the statements to Wikidata is currently *very* important, but I'll do as much as possible to make it easier in the next few weeks.