Page:The World Within Wikipedia： An Ecology of Mind.pdf/7

Information 2012, 3

$$\upsilon_{ij}=\frac{tf_{ij}*idf_i}{||tf_{ij}*idf_i||}$$

where υij is the jth component of vector υi, normalized by the length of the tf-idf vector. As a result, all ESA vectors have a length of 1. Similarity between terms is computed as the vector cosine between term vectors.

Because Wikipedia is a rather large corpus with many unique terms, ESA approaches have used pruning heuristics to reduce the space to a more manageable size,. Pruning heuristics include removing articles with less that 100 words, removing articles with fewer than five total links (inlinks and outlinks), removing high frequency words (stop words), removing rare words, and transforming the remaining words into their approximate uninflected root form (a process called stemming).

2.3. Wikipedia Link Measure

Previous work has used the link structure of Wikipedia to derive a semantic similarity measure using the Wikipedia Miner toolkit. The basic premise of this approach is that every Wikipedia page has pages that link to it (inlinks) as well as pages it links to (outlinks). The links are inline with the text, meaning that a word or phrase in the text has been hyperlinked. The words/phrases themselves are called anchors, and they can be viewed as synonyms for the target page, e.g., motor car, car, and automobile link to the Wikipedia page Automobile. The Wikipedia Link Measure (WLM) uses anchors to map input text to Wikipedia articles and then uses the inlinks and outlinks of these articles to derive the similarity between words. Because the meaning representation of each word is defined by the links between its associated concept and other concepts, WLM matches our definition of a concept-concept level model.

Consider two articles, Football and Sport. For a particular link type, say inlinks, and these two articles, we can place all other articles in Wikipedia into one of four categories as shown in Table 2. The frequencies in Table 2 are hypothetical, but they serve to illustrate a common structure. First, the number of links shared by two articles are likely to be relatively small compared to the number of links they possess individually. Secondly, the number of links that neither has is likely to be very large and relatively close to the total number of articles.

Table 2. Hypothetical inlink overlap for Football and Sport.

Intuitively, two articles that share inlinks and outlinks are likely to be similar; however, to the extent that some links may be common across many articles they should be weighted less. This intuition is captured in the WLM outlink metric, which weights each outlink o by log(|A|/|O|), the log of the number of total articles in Wikipedia |A| divided by the number of articles that link to that page |O|. Weighted outlink vectors are constructed based on the union of outlinks for two articles, and the outlink