Page:Sm all cc.pdf/49

 who had published a rather surprising conclusion concerning rats. A visitor asked him if he could see more of the evidence. The researcher replied, ‘Sure, there’s the rat.’” [Wilson, 1952]

Without attention to statistical evidence and confirmatory power, the scientist falls into the most common pitfall of non-scientists: hasty generalization. One or a few chance associations between two attributes or variables are mistakenly inferred to represent a causal relationship. Hasty generalization is responsible for many popular superstitions, but even scientists such as Aristotle were not immune to it. Hasty generalizations are often inspired by coincidence, the unexpected and improbable association between two or more events. After compiling and analyzing thousands of coincidences, Diaconis and Mostelle [1989] found that coincidences could be grouped into three classes:
 * cases where there was an unnoticed causal relationship, so the association actually was not a coincidence;
 * nonrepresentative samples, focusing on one association while ignoring or forgetting examples of non-matches;
 * actual chance events that are much more likely than one might expect.

An example of this third type is that any group of 23 people has a 50% chance of at least two people having the same birthday.

Coincidence is important in science, because it initiates a search for causal relationships and may lead to discovery. An apparent coincidence is a perfectly valid source for hypotheses. Coincidence is not, however, a hypothesis test; quantitative tests must follow.

The statistical methods seek to indicate quantitatively which apparent connections between variables are real and which are coincidental. Uncertainty is implicit in most measurements and hypothesis tests, but consideration of probabilities allows us to make decisions that appropriately weigh the impact of the uncertainties. With suitable experimental design, statistical methods are able to deal effectively with very complex and poorly understood phenomena, extracting the most fundamental correlations.

Correlation
“Every scientific problem is a search for the relationship between variables.” [Thurstone, 1925] Begin with two variables, which we will call X and Y, for which we have several measurements. By convention, X is called the independent variable and Y is the dependent variable. Perhaps X causes Y, so that the value of Y is truly dependent on the value of X. Such a condition would be convenient, but all we really require is the possibility that a knowledge of the value of the independent variable X may give us some ability to predict the value of Y.

To introduce some of the concerns implicit in correlation and pattern recognition, let’s begin with three examples: National League batting averages, the government deficit, and temperature variations in Anchorage, AK.