Page:The World Within Wikipedia： An Ecology of Mind.pdf/9

Information 2012, 3 and from concrete (car) to abstract (psychology). Judgments are made on a scale of 0 to 10, with 0 representing no relatedness and 10 representing maximal relatedness. The data is divided into two sets. The first set contains ratings by thirteen judges on 153 word pairs. The second set consists of ratings by sixteen judges on 200 word pairs. We assessed inter-rater reliability for both sets using Cronbach’s α and found a high level of agreement, α = 0.97. The following analyses present results on all 353 pairs using the average rating across judges on each pair. Previous work on this dataset has reported results in terms of a non-parametric Spearman correlation; this metric of performance is also adopted here.

COALS has been previously shown to exhibit impressive performance on semantic tasks, , including the WordSimilarity-353 task. The previous best result of COALS on the WordSimilarity353 task, r(351) = 0.67, was achieved using COALS-SVD with 500 singular values on a 1.2 billion word Usenet corpus. Using Wikipedia, our implementation yielded a stronger correlation with WordSimilarity-353 than previously reported, r(351) = 0.72, p < 0.001, but this difference is not significant, z = −1.28, p = 0.10.

The correlation between ESA and the WordSimilarity-353 data set is the highest previously reported, r(351) = 0.75. It should be noted that an ESA model using additional link information has been attempted, but yielded no improvements over this basic model. The ESA implementation we used also correlated with the human ratings, r(351) = 0.67, p < 0.001, which is significantly lower than the original reported correlation, p = 0.02. A plausible explanation of this difference is that some tweaks crucial for high performance, such as inverted index pruning, which are not implemented in the reference implementation we used, are necessary to achieve the originally reported correlation.

Milne & Witten found that WLM was highly correlated with human judgments in WordSimilarity-353, r(351) = 0.69. The WLM reference implementation we used also correlated with human ratings in the data set, r(351) = 0.66, p < 0.001, but was lower than the original reported correlation r(351) = 0.69. The difference in correlations was not significant, z = −0.73, p = 0.23. The reason for this discrepancy is unclear, but it may be attributed to the differences in versions of Wikipedia used here and in the initial reported research.

The W3C3 model has state of the art correlation with the WordSimilarity-353 data set, r(351) = 0.78, p < 0.001. Correlations for all models are presented in Table 3. The W3C3 model’s correlation is significantly higher than all correlations in the replicated results, p ≤ 0.03, but not significantly higher than the best previously published ESA result, p = 0.17.

Table 3. Current and previous correlations with WordSimilarity-353.

Previous work has found that distributional models and graphical (WordNet-based) models have differing performance on WordSimilarity-353 depending on whether the word pairs in question have