Page:Sm all cc.pdf/59

 $$m = [N\sum X_iY_i-(\sum X_i)(\sum Y_i)]/[N\sum X_i^2-(\sum X_i)^2]$$

$$b = [(\sum Y_i)(\sum X_i^2)-(\sum X_iY_i)(\sum X_i)]/[N\sum X_i^2-(\sum X_i)^2]$$

Most spreadsheet and graphics programs include a linear regression option. None, however, mentions the implicit assumptions discussed above.

Linear regression fits the line that minimizes the squares of the residuals of Yi deviations from the line. This concept is illustrated in Figure 15a, which shows a linear regression of leading National League batting averages for the years 1901-1920. This concept of minimizing the squares of Yi deviations is very important to remember as one uses linear regression, for it accounts for several characteristics of linear regression. First, we now understand the assumption that only the Yi have errors and that these errors are random, for it is these errors or discrepancies from the trend that we are minimizing. If instead the errors were all in the Xi, then we should minimize the Xi instead (or, much easier, just rename variables so that Y becomes the one with the errors).

Second, minimizing the square of the deviation gives greatest weighting to extreme values, in the same way that extreme values dominate a standard deviation. Thus, the researcher needs to investigate the possibility that one or two extreme values are controlling the regression. One approach is to examine the regression line on the same plot as the data. Even better, plot the regression residuals -the differences between individual Yi and the predicted value of Y at each Xi, as represented by the vertical line segments in Figure 15a. Regression residuals can be plotted either as a function of Xi (Figure 15b) or as a histogram.

Third, the use of vertical deviations accounts for the name linear regression, rather than a name such as linear fit. If one were to fit a trend by eye through two correlated variables, the line would be steeper than that determined by regression. The best-fit line regresses from the true line toward a horizontal no-fit line with increases of the random errors of Y. This corollary is little-known but noteworthy; it predicts that if two labs do the same type of measurements of (Xi, Yi), they will obtain different linear regression results if their measurement errors are different.