visit
SPARQL (pronounced “”, a for SPARQL Protocol and RDF Query Language) is an — that is, a for — able to retrieve and manipulate data stored in formatWell, this is not a very good definition. It hardly tells you what it can do. To translate it into human-readable language:
SPARQL is a query language similar to SQL in syntax but works on a knowledge graph database like Wikipedia, that allows you to extract knowledge and information by defining a series of filters and constraints.If this is still too abstract to you, look at the image below:
(Awarded Chemistry Nobel Prizes)
It is a timeline of awarded chemistry Nobel prizes, generated by the website, using the code below:#Awarded Chemistry Nobel Prizes
#defaultView:Timeline
SELECT DISTINCT ?item ?itemLabel ?when (YEAR(?when) as ?date) ?pic
WHERE {
?item p:P166 ?awardStat . # … with an awarded(P166) statement
?awardStat ps:P166 wd:Q44585 . # … that has the value Nobel Prize in Chemistry (Q35637)
?awardStat pq:P585 ?when . # when did he receive the Nobel prize
SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
OPTIONAL { ?item wdt:P18 ?pic }
}
SELECT ?person
WHERE {
?person wdt:P106 wd:Q5482740 .
}
Here we defined a
?person
as the subject of interest, this is also what will appear as a column in our query results. Then we specify some constraints with WHERE
. The constraints are wdt:P106
need to be wd:Q5482740
. What? You say. Let me explain it in more detail. wdt
is a prefix of a ‘predicate’ or ‘attribute’ of the subject while wd
is the prefix of a value(object in SPARQL terms, but that’s not important) of the attribute. wdt:
means I am gonna specify an attribute of the subject here, and wd:
means I will specify what the value of this attribute is. So what is P106
and Q5482740
? These are just a code for the specific attribute and value. P106
stands for ‘occupation’ and Q5482740
stands for ‘programmer’. This line of code means, I want the ?person
subject to have an attribute of ‘occupation’ of ‘programmer’. Not that scary anymore, right? You can find these codes easily on the WikiData page mentioned above.Run the query and you’ll get the following results:We got a bunch of
person
items with different wd:value
. If you look closer at the value, they are actually the code for a different person. For example, the first one wd:Q80
is , the inventor of WWW. This is not intuitive, we want to be able to directly see the names. To do that, we add a WikiData ‘label service’ that helps us translate the code to name, like so:SELECT ?person ?personLabel
WHERE {
?person wdt:P106 wd:Q5482740 .
?person rdfs:label ?personLabel .
FILTER ( LANGMATCHES ( LANG ( ?personLabel ), "fr" ) )
}
Similar syntax, we want the
person
to have a ‘label’ attribute, and we define a personLabel
value variable to hold these values so we can display them in the query results. Also, we added the personLabel
into our SELECT
phrase so it will be displayed. Please be noted that I also added a FILTER below to only display the French language label, otherwise it will show multiple language labels for one person, which is not what we want:SELECT ?person ?personLabel ?notableworkLabel
WHERE {
?person wdt:P106 wd:Q5482740 .
?person rdfs:label ?personLabel .
FILTER ( LANGMATCHES ( LANG ( ?personLabel ), "fr" ) )
?person wdt:P800 ?notablework .
?notablework rdfs:label ?notableworkLabel .
FILTER ( LANGMATCHES ( LANG ( ?notableworkLabel ), "fr" ) )
}
Again,
wdt:P800
means ‘notable work’ attribute, everything else is similar. We then get the following results:SELECT ?person ?personLabel ( GROUP_CONCAT ( DISTINCT ?notableworkLabel; separator="; " ) AS ?works )
WHERE {
?person wdt:P106 wd:Q5482740 .
?person rdfs:label ?personLabel .
FILTER ( LANGMATCHES ( LANG ( ?personLabel ), "fr" ) )
?person wdt:P800 ?notablework .
?notablework rdfs:label ?notableworkLabel .
FILTER ( LANGMATCHES ( LANG ( ?notableworkLabel ), "fr" ) )
}
GROUP BY ?person ?personLabel
Here ‘
GROUP BY
’ is used. Also, GROUP_CONCAT
function is used to concatenate multiple notableworkLabel
into a new column works
(I will not explain how these functions work, just want to quickly show you what SPARQL can do. Please feel free to Google if you want to know more, there are plenty of tutorial articles and videos out there):SELECT ?person ?personLabel ( GROUP_CONCAT ( DISTINCT ?notableworkLabel; separator="; " ) AS ?works ) ?image
WHERE {
?person wdt:P106 wd:Q5482740 .
?person rdfs:label ?personLabel .
FILTER ( LANGMATCHES ( LANG ( ?personLabel ), "fr" ) )
?person wdt:P800 ?notablework .
?notablework rdfs:label ?notableworkLabel .
FILTER ( LANGMATCHES ( LANG ( ?notableworkLabel ), "fr" ) )
OPTIONAL {?person wdt:P18 ?image}
}
GROUP BY ?person ?personLabel ?image
#defaultView:ImageGrid
SELECT ?person ?personLabel ( GROUP_CONCAT ( DISTINCT ?notableworkLabel; separator="; " ) AS ?works ) ?image ?countryLabel ?cood
WHERE {
?person wdt:P106 wd:Q5482740 .
?person rdfs:label ?personLabel .
FILTER ( LANGMATCHES ( LANG ( ?personLabel ), "fr" ) )
?person wdt:P800 ?notablework .
?notablework rdfs:label ?notableworkLabel .
FILTER ( LANGMATCHES ( LANG ( ?notableworkLabel ), "fr" ) )
OPTIONAL {?person wdt:P18 ?image}
OPTIONAL {?person wdt:P19 ?country .
?country rdfs:label ?countryLabel .
?country wdt:P625 ?cood .
FILTER ( LANGMATCHES ( LANG ( ?countryLabel ), "fr" ) )
}
}
GROUP BY ?person ?personLabel ?image ?countryLabel ?cood
You can decipher the code above yourself maybe. It basically says I want this person to have an attribute of
country
, put into a variable country
, then find out the coordinates
of the country and put into a variable cood
. With the coordinates, we can activate the ‘map’ view:You can click the ‘Example’ button on the WikiData page to find out more fun and interesting examples you can do with it.
Previously published at //towardsdatascience.com/how-to-extract-knowledge-from-wikipedia-data-science-style-35f50f095d1a