Page:Wikidata as a knowledge graph for the life sciences.pdf/6

 Feature Article

Science Forum Wikidata as a knowledge graph for the life sciences

SELECT DISTINCT ?compound ?compoundLabel where {

(i)

?gene wdt:P688 ?protein wdt:P681 ?cc wdt:P279*|wdt:P361*
 * 1) gene product is localized to

(ii)

the membrane ?protein. ?cc. wd:Q14349455.

?pathway wdt:P31 wd:Q4915012 ; wdt:P527 ?gene ; wdt:P527 ?gene2. ?gene2 wdt:P31 wd:Q7187.
 * 1) gene is involved in a pathway with another gene ("gene2")

(iii)

known enzyme inhibitor ?gene2 wdt:P688 ?protein2. ?protein2 wdt:P129 ?compound ; wdt:P527 wd:Q24787419 ; p:P129 ?s2. ?s2 ps:P129 ?cp2. ?compound wdt:P31 wd:Q11173. FILTER EXISTS {?s2 pq:P366 wd:Q427492 .}
 * 1) gene2 product has a Ser/Thr protein kinase domain AND

(iv) (v)

Disease of anatomical entity

(i) Genetic association endocrine system disease

Immune system disease

Lower respiratory tract disease

(iii) Has part

(ii) Membrane

(i) Respiratory System disease

Upper respiratory tract disease

(ii) Gene 1 (ii) Encodes (ii) Protein 1 (ii) Part of

(v) Physical interaction

(iv) Gene 2 (iii) Has part (iv) Encodes (iv) Protein 2 (iv) Contains part

(iii) Biological pathway

?gene wdt:P31 wd:Q7187. ?gene wdt:P2293 ?diseaseGA. ?diseaseGA wdt:P279* wd:Q3286546.
 * 1) gene has genetic association with a respiratory disease

(iv) Serine/threonine protein kinase

(v) Chemical compound

SERVICE wikibase:label { bd:serviceParam wikibase:language "en". } }

Items relating to pathology

Items relating to cell biochemistry

Figure 3. A representative SPARQL query that integrates data from multiple data resources and annotation types. This example integrative query incorporates data on genetic associations to disease, Gene Ontology annotations for cellular compartment, protein target information for compounds, pathway data, and protein domain information. Specifically, this query (depicted schematically at right) retrieves genes that are (i) associated with a respiratory system disease, (ii) that encode a membrane-bound protein, and (iii) that sit within the same biochemical pathway as (iv) a second gene encoding a protein with a serine-threonine kinase domain and (v) a known inhibitor, and reports a list of those inhibitors. Aspects related to Disease Ontology in blue; aspects related to biochemistry in red/orange; aspects related to chemistry in green. Properties are shown in italics. Realtime query results can be viewed at https://w.wiki/6pZ.

Almost any competent informatician can perform the query described above by integrating cell localization data from Gene Ontology annotations, genetic associations from GWAS Catalog, disease subclass relationships from the Human Disease Ontology, pathway data from WikiPathways and Reactome, compound targets from the IUPHAR Guide to Pharmacology, and protein domain information from InterPro. However, actually performing this data integration is a time-consuming and error-prone process. At the time of publication of this manuscript, this Wikidata query completed in less than 10 s and reported 31 unique compounds. Importantly, the results of that query will always be up-todate with the latest information in Wikidata. This query, and other example SPARQL queries that take advantage of the rich, heterogeneous knowledge network in Wikidata are available at https://www.wikidata.org/wiki/User: ProteinBoxBot/SPARQL_Examples. That page additionally demonstrates federated SPARQL queries that perform complex queries across other biomedical SPARQL endpoints. Federated queries are useful for accessing data that cannot be included in Wikidata directly due to limitations in size, scope, or licensing.

Waagmeester et al. eLife 2020;9:e52614. DOI: https://doi.org/10.7554/eLife.52614

Crowdsourced curation Ontologies are essential resources for structuring biomedical knowledge. However, even after the initial effort in creating an ontology is finalized, significant resources must be devoted to maintenance and further development. These tasks include cataloging cross references to other ontologies and vocabularies, and modifying the ontology as current knowledge evolves. Community curation has been explored in a variety of tasks in ontology curation and annotation (see, for example, Bunt et al., 2012; Gil et al., 2017; Putman et al., 2019; Putman et al., 2017; Wang et al., 2016). While community curation offers the potential of distributing these responsibilities over a wider set of scientists, it also has the potential to introduce errors and inconsistencies. Here, we examined how a crowd-based curation model through Wikidata works in practice. Specifically, we designed a hybrid system that combines the aggregated community effort of many individuals with the reliability of expert curation. First, we created a system to monitor, filter, and prioritize changes made by Wikidata contributors to items in the Human Disease Ontology. We initially seeded Wikidata with disease items from the Disease Ontology (DO) starting in late 2015. Beginning in 2018, we

6 of 15