Page:Citation Detective WikiWorkshop2020.pdf/3

Citation Detective: a Public Dataset to Improve and Quantify Wikipedia Citation Quality at Scale

3.1.4 Citation Need Model Prediction.
We throw the word embeddings and section embeddings to the Citation Need model for predicting a score 𝑦 in the range [0, 1]: the higher the score, the more likely that a sentence needs a citation.

3.1.5 Storing Data into Database.
In the last step, we store into a SQL database each sentence with a score higher than 𝑦ˆ >= 0.5: the text of the sentence, the text of the paragraph which contains the sentence, the section title, the revision ID of the article, and the predicted citation need score. The schema is shown in Table 1. Table 1: Schema of Citation Detective

3.2 System Implementation Details
In this section we briefly introduce the implementation details for Citation Detective and the important design decisions learnt from the technology transfer.

When processing the text corpus of Wikipedia articles, we need to parse Wikitext, also known as Wiki markup or Wikicode, which consists of special syntax and templates for inserting images, hyperlinks, tables, etc. mwparserfromhell provides an easy-to-use and powerful parser for Wikicode so that we can simply get section titles and filter out infobox, images, tables that are not necessary to throw in the model. While we need to process the data for the Citation Model, in the Citation Detective database, we eventually store sentences in the original, unprocessed Wikicode format, which means sentences may contain any Wiki markups such as templates and links. This design decision is to ensure other tools can consume the data more easily. Tools just have to look for that text in the Wikicode at the specified revision. While plain text format is easy for humans to read, matching it with its corresponding Wikicode is non trivial for machine-assisted tools and other stakeholders.

Since the system is meant to work on a large number of articles in Wikipedia, efficiency is an important issue. In practice, to classify sentences at scale, we leverage multiple processes on a given machine. We observed that one of the bottlenecks of the system spend is the time needed by the Wikipedia API to query the content of the articles. Therefore, we use a pool of worker processes to parallelize the task, distributing the querying task across processes. On the other hand, in order to parallelize the model prediction, we load the Citation Need model in a separate process and shared it with the other worker processes. Worker processes communicate with the server process via a proxy and perform prediction tasks across processes. In the experiment, the multiprocess version is 3.3x speedup compared to the single-process version. For the first version of Citation Detective, we set the article_sample_rate at 0.2.

Table 2: Summary of data for Citation Quality Analysis

3.3 Database Release and Update
The Citation Detective database is now available on the Wikimedia Toolforge as the public SQL database citationdetective_p. Every time we update the database, the Citation Detective takes a random 2% sample of articles in English Wikipedia, namely around 120 thousand articles, resulting in around 380 thousand sentences in the database which are classified as needing citations. Access to the database from outside the Toolforge environment is not currently possible, but is under investigation for the future.

4 ANALYZING CITATION QUALITY AT SCALE
In this Section, we provide an example of use-case for systems like Citation Detective: quantifying the quality of citations in Wikipedia at scale. We use the Citation Need models to quantify citation quality on hundreds of thousands of articles from English Wikipedia, we analyze the relation between article quality and citation quality, and break down these statistics by article topic.

4.1 Data Collection
To perform this analysis, we first need data about articles, their sentences, and their citation need. Since, at the time of writing, the Citation Detective system is still under refinement, we create a one-off dataset for this experiment. We sample 7% of the articles in English Wikipedia, and then randomly sample 10 sentences for each article. We report in Table 2 a summary of the data used for these experiments.

4.1.1 Extracting Article Quality, Topic, Popularity, and Reference Quality Labels.
We then extract basic article properties. First, we use the ORES scoring platform to extract the articles’ 𝑡𝑜𝑝𝑖𝑐 category (e.g. Science, Economics, etc) and level of 𝑞𝑢𝑎𝑙𝑖𝑡𝑦 (Stub, Good Article, etc.). We also use the Pageviews API to get the number of total 𝑣𝑖𝑒𝑤𝑠 received by each article during the month of May 2019. Finally, we check which articles in our data have been marked by editors as "Missing Sources", i.e. they appear in the category “All articles needing additional references" . We will use these manual labels as grountruth to validate article’s citation quality.

4.1.2 Computing Article’s Citation Quality.
Using the Citation Need model, we then compute article citation quality, namely the proportion of "well sourced" sentences in an article. To do so, we classify all sentences with the model, and label each sentence with a binary ˆ where Citation Need label 𝑦 according to the model output: 𝑦 = [𝑦], [·] is the rounding function and 𝑦ˆ is the output of the Citation Need model. When 𝑦 = 1, the sentence needs a citation, when 𝑦 = 0, the sentence doesn’t need one. Next, we aggregate sentence-level Citation Need labels to calculate the article citation quality 𝑄. 𝑄 is