Page:Wikidata making of.pdf/2

 to thousands of wiki sites (successful, but not used for Wikidata), numerous funding proposals (some failed, some successful), and ideas from many different people, both in the Wikimedia movement and in the (Semantic) Web community.

Each step in this journey has also been witnessed by some or all of the authors of this paper, but a full account of these steps, their causes, and influences, has never been given in a coherent form. In this paper, we therefore embark upon the risky endeavor of recounting a history that is, in part, also our own. The result is nevertheless more than a piece of self-recorded oral history, since available online sources allow us to reconstruct not just what happened, but often also what the original plans and motivations have been, and how they have changed over time. Our subjective perspective will still play a major role in filling the gaps, offering explanations, and deriving objectives for the future.

Overall, we hope that our work can provide relevant insights not just about Wikidata, but also about the history of three influential ideals that have found their expression in many social, political, and technological developments of our time, especially on the Web:


 * 1) Community: the confidence that sensible people will work together to make the world a better place
 * 2) Sharing: the goal of making knowledge, and digital resources in general, freely available to every human
 * 3) Explication: the goal to formally specify information in explicit, unambiguous, and machine-processable ways

These ideals are neither universally accepted nor free of internal conflicts, but they continue to inspire. All three of them are closely tied to the development of the Web [8], to which they have also made important individual contributions: community is the basis of the wiki principle [34], sharing is the driving force of the open source and open knowledge movements, and explication through formal specification has motivated strong Web standards and the Semantic Web activity [7, 61]. Wikipedia naturally combines community and sharing, but Wikidata has pioneered the reconciliation of all three ideals.

2 WIKIDATA AT TEN YEARS OF AGE

Before discussing its development further, we take a closer look at what Wikidata is today, to set it apart both from other activities and from its own former visions. As stated above, Wikidata is a knowledge graph, a community, an online platform, and a Wikimedia project. The aforementioned ideals are strongly represented in its design: In addition to these fundamentals, Wikidata is also characterized by several further design choices:
 * 1) Community: all content (data and schema) is directly controlled by an open community, not by the development team at Wikimedia Deutschland
 * 2) Sharing: the data is licensed under Creative Commons CC-0, which imposes no restrictions on usage or distribution
 * 3) Explication: content is structured according to its own data model, is exported in the RDF standard, and is open to machine reading/writing through APIs


 * 1) Multi-linguality: One Wikidata serves all languages; user-visible labels are translated, but underlying concepts and structures are shared; language-independent IDs are used
 * 2) Verifiability, not truth: Wikidata relies on external sources for confirmation; statements can come with references; conflicting or debated standpoints may co-exist
 * 3) Integration with Wikimedia: Wikidata is a data backbone for other Wikimedia projects (linking articles on the same topic across languages, providing data displayed in Wikipedia articles, supplying image tags for Wikimedia Commons, etc.)
 * 4) Identity provider: Wikidata concepts have stable, language-independent identifiers, linked with other resources (catalogs, archives, social networks, etc.) via external identifiers

These design choices distinguish Wikidata from many other structured knowledge collection efforts. Various projects rely on information extraction, partly from Wikipedia pages, most notably Yago [64], DBpedia [4], and Knowledge Vault [14]. Differences include the lack of direct community control, mono-linguality, and lack of verifiability (no references). Stronger similarities exist with the late Freebase [9], Metaweb’s (and later Google’s) discontinued knowledge graph community, and indeed some of this data was incorporated into Wikidata after the closing of that project [50]. Another related project is Semantic MediaWiki [26], on which we will have more to say later.

The data collected in most of these projects can also be considered knowledge graphs, i.e., structured data collections that encode meaningful information in terms of (typed, directed) connections between concepts. Nevertheless, the actual data sets are completely different, both in their vocabulary and their underlying data model. In comparison to other approaches, Wikidata has one of the richest graph formats, where each statement (edge in the graph) can have user-defined annotations (e.g., validity time) and references.

Today, Wikidata is at the core of the Wikimedia projects, a central resource of the world-wide knowledge ecosystem, and an integral part of technologies the world uses every day. Basic statistics are summarized in Table 1. Users of Wikidata’s data include technology organizations (e.g., Google, IBM [19, 44], Quora [74], reddit, Wolfram Alpha [60], Apple, Amazon, OpenAI [53], Twitter [23]) and cultural and educational institutions (e.g., the Met [35], Smithsonian [62], Internet Archive [47], The Science Museum [15], dblp [55].