Page:The World Within Wikipedia： An Ecology of Mind.pdf/10

Information 2012, 3 a similarity relationship or a more general relatedness relationship. To test this hypothesis with the COALS, ESA, WLM, and W3C3 models, we used the same partitioning of the dataset into similarity and relatedness pairs. The similar pairs are synonyms, antonyms, identical, or hyponym-hyperonym, and the related pairs are meronym-holonym or other relations. Inter-rater agreement in the coding of the pairs was high, Cohen’s kappa = 0.77. The similarity and relatedness subsets contained the similar and related pairs described above and shared the same unrelated pairs, yielding 203 pairs for similarity and 252 pairs for relatedness. Correlations for all models on these subsets are presented in Table 4. The difference in correlations between the W3C3 model and the previous best Agirre model is significant for both similarity, p = 0.0465 and relatedness, p = 0.02.

Table 4. Correlations with WordSimilarity-353 similarity and relatedness subsets.

For both similarity and relatedness subsets, the W3C3 model performed significantly better than its constituent models, and each model performed significantly better on the similarity set than on the relatedness set, p < 0.05, except for ESA, p = 0.06. However, these are rather coarse sets: As mentioned above, the similarity set is an aggregation of common semantic relationships. In order to better understand the relative performance of each model on these subtypes, we used the labeled semantic categories for each WordSimilarity-353 pair provided in the similarity/relatedness subsets. These are antonym, hypernym (first word is hypernym of second word or vice versa), identical, part-of (first word is a part of the second or vice versa), siblings (share a parent category, e.g., dalmatian and collie are children of dog), synonyms, and topical (some relationship other than previous relationships, e.g., ladder and lightbulb). Grouping pairs by semantic category, we calculated the average distance between predicted rank and the human rank for each model. The results are shown in Table 5. Since the ideal distance to the human ranking is zero, lower scores are better. The lowest score in each row is in boldface, and the second lowest score is italicized.

The most striking pattern in Table 5 is that two-thirds of the best scores per category belong to the W3C3 model. Moreover, for every category save one, the W3C3 model either has the best score or the second best score. Thus breaking down the WordSimilarity-353 pairs by semantic categories is producing the same pattern of results seen in Tables 3 and 4: The three constituent models are providing different kinds of information, and averaging their outputs is creating a more human-like measure of semantic comparison than any of them individually. A linear regression was conducted to explore this possibility. The scores given by COALS, ESA, WLM, and human judges were converted to ranks, and then a linear regression on the ranks was performed, using COALS, ESA, and WLM ranks to predict the human judgment ranks. The results of the linear regression are presented in Table 6. Tolerance