Page:The World Within Wikipedia： An Ecology of Mind.pdf/4

Information 2012, 3

However, traditional models such as LSA are based solely in language structure, and so they do not model the mutual influence between cognition and language. This is partly because the available environments for such models have been entirely linguistic, e.g., text-dumps of books, newspapers, and other abundant sources of text. In contrast, the advance of the Internet has given rise to data sets that are created and organized in novel ways that reflect human conceptual/categorical organization. Wikipedia is the prototypical example of this new breed of cognitive-linguistic environment. It is read and edited daily by millions of users. As an online encyclopedia, Wikipedia is structured around articles pertaining to concept-specific entries. Additionally, Wikipedia’s structure is augmented by hyperlinks between articles and other kinds of pages such as category pages, which provide loose hierarchical structure, and disambiguation pages, which disambiguate entries with exact or highly similar names. Using Wikipedia as a cognitive-linguistic environment, a computational model that incorporates both the mutual influences of conceptual/categorical organization and the structure of language should produce behavior closer to human behavior than a model without such mutual influence.

Several researchers have already used Wikipedia’s structure in models that emulate human semantic comparisons. In this paper we extend their work in two significant ways. First, rather than focus on a single type of structure, e.g., link structure or concept structure, we present a model that utilizes three levels of structure: Word-word, word-concept, and concept-concept (W3C3) to more fully represent the cognitive-linguistic environment of Wikipedia. As we will show in the following sections, each of these levels independently contributes to an explanation of human semantic behavior. Secondly, in addition to the common dataset considered by previous researchers using Wikipedia, the WordSimilarity-353 dataset, we apply the W3C3 model to a wider array of behavioral data, including word association norms, semantic feature production norms , and false memory formation. Studies 1 to 4 examine how the W3C3 model manifests language structure and categorization effects across this wide array of behavioral data. Our analysis suggests that, at multiple levels of structure, Wikipedia reflects the aspects of meaning that drive semantic associations. More specifically, meaning is reflected in the structure of language, the organization of concepts/categories, and the linkages between them. Our results inform the internalist/externalist debate by showing just how much internal cognitive-linguistic structure used in these tasks is preserved externally in Wikipedia.

2. Semantic Models

In the following sections we present three approaches that when applied to Wikipedia extract models of semantic association at three different levels. The first model, the Correlated Occurrence Analogue to Lexical Semantics, operates at a word-word level. The second model, Explicit Semantic Analysis, , operates at a word-concept level. The third and final model, Wikipedia Link Measure ., operates at a concept-concept level. We then describe a joint model (W3C3) that trivially combines these three models.

2.1. Correlated Occurrence Analogue to Lexical Semantics

The Correlated Occurrence Analogue to Lexical Semantics (COALS) model implements a sliding window strategy to build a word by word matrix of normalized co-occurrences. Because the