Page:The World Within Wikipedia： An Ecology of Mind.pdf/8

Information 2012, 3 similarity is the cosine between these two vectors. The inlink metric is modeled after Normalized Google Distance and measures the extent to which the inlinks X of article x intersect the inlinks Y of article y. If the intersection is inclusive, X = Y, the metric is zero:

$$inlink(x,y)=\frac{log(max)(|X|,|Y|)-log(|X\cap Y|)}{log(|A|)-log(min(|X|,|Y|))}$$

Inlink and outlink metrics are averaged to produce a composite score. Since each anchor defines a set of possible articles, the computations above produce a list of scored pairs of articles for a given pair of anchors. For example, the anchor bill links to bill-law and bill-beak, and the anchor board links to board-directors and board-game, leading to four similarity scores for each possible pairing. WLM selects a particular pair by applying the following heuristics. First, only articles that receive at least 1% of the anchor’s links are considered. Secondly, WLM accumulates the most related pairs (within 40% of the maximum related pair) and selects from this list the most related pair. It’s not clear from the discussion in Milne & Witten whether this efficiency heuristic differs from simply selecting the most probable pair except in total search time.

2.4. W3C3: Combined Model

In this section we present our combined model using implementations of the models described above. We call this model W3C3 because it combines information at the word-word, word-concept, and concept-concept levels. For each model except COALS, reference implementations were chosen that are freely available on the web.

To implement Wikipedia Miner’s WLM, we downloaded version 1.1 from Sourceforge and an xml dump of Wikipedia from October 2010. ESA does not have a reference implementation provided by its creators. However, Gabrilovich recommends another implementation with specific settings to reproduce his results. Following these instructions, we installed a specific build of Wikiprep-ESA and used a preprocessed Wikipedia dump made available by Gabrilovich. We created our own implementation of COALS and created a COALS-SVD-500 matrix using the same xml dump of Wikipedia from October 2010 as was used for WLM above.

One intuition that motivates combining all three techniques into a single model is that each represents a different kind of meaning at a different level: Word-word, word-concept, and concept-concept. This intuition was the basis for our simplistic unsupervised W3C3 model, which is simply to average the relatedness scores given by these three techniques. Two relevant properties of the W3C3 model are worth noting. First, it has not been trained on any part of the data. Secondly, it has no parameters for combining the three constituent models; rather their three outputs are simply averaged to yield a single output score.

3. Study 1: WordSimilarity-353

The WordSimilarity-353, collection is a standard dataset widely used in semantic relatedness research, , , , , , ,. It was developed as a means of assessing similarity metrics by comparing their output to human ratings. WordSimilarity-353 contains 353 pairs of nouns and their corresponding judgments of semantic association. The nouns range in frequency from low (Arafat) to high (love)