# Common Errors in Statistics and How to Avoid Them - Good P.I

**Download**(direct link)

**:**

**59**> 60 61 62 63 64 65 .. 90 >> Next

For example, suppose the losses are proportional to the square of the prediction errors, and we have chosen our model’s parameters so as to minimize the sum of squares of the differences y; - M[x] for the historical data. Unfortunately, minimizing this sum of squares is no guarantee that when we continue to make observations, we will continue to minimize the sum of squares between what we observe and what our model predicts. If you are a businessman whose objective is to predict market response, this distinction can be critical.

There are at least three reasons for the possible disparity:

1. The original correlation was spurious.

2. The original correlation was genuine but the sample was not representative.

3. The original correlation was genuine, but the nature of the relationship has changed with time (as a result of changes in the

140 PART III BUILDING A MODEL

underlying politic, market, or environment, for example). We take up this problem again in our chapter on prediction error.

And lest we forget: Association does not “prove” causation, it can only contribute to the evidence.

Indicator Variables

The use of an indicator (yes/no) or a nonmetric ordinal variable (improved, much improved, no change) as the sole independent (X) variable is inappropriate. The two-sample and ^-sample procedures described in Chapter 5 should be employed.

Transformations

It is often the case that the magnitude of the residual error is proportional to the size of the observations; that is, y = E( Y |x)e. A preliminary log transformation will restore the problem to linear form log(y) = log E(Y |x)

+ e'. Unfortunately, even if e is normal, e' is not, and the resulting confidence intervals need to be adjusted (Zhou and Gao, 1997).

Curve-Fitting and Magic Beans

Until recently, what distinguished statistics from the other branches of mathematics was that at least one aspect of each analysis was firmly grounded in reality. Samples were drawn from real populations and, in theory, one could assess and validate findings by examining larger and larger samples taken from that same population.

In this reality-based context, modeling has one or possibly both of the following objectives:

1. To better understand the mechanisms leading to particular responses.

2. To predict future outcomes.

Failure to achieve these objectives has measurable losses. While these losses cannot be eliminated because of the variation inherent in the underlying processes, it is hoped that by use of the appropriate statistical procedure, they can be minimized.

By contrast, the goals of curve fitting (nonparametric or local regression)3 are aesthetic in nature; the resultant graphs, though pleasing to the eye, may bear little relation to the processes under investigation. To quote Green and Silverman [1994, p. 50], “there are two aims in curve estimation, which to some extent conflict with one another, to maximize goodness-of-fit and to minimize roughness.”

3 See, for example Green and Silverman [1994] and Loader [1999].

CHAPTER 9 UNIVARIATE REGRESSION 141

The first of these aims is appropriate if the loss function is mean-square error.4 The second creates a strong risk of overfitting. Validation is essential, yet most of the methods discussed in Chapter 11 do not apply. Validation via a completely independent data set cannot provide confirmation, because the new data would entail the production of a completely different, unrelated curve. The only effective method of validation is to divide the data set in half at random, fit a curve to one of the halves, and then assess its fit against the entire data set.

SUMMARY

Regression methods work well with physical models. The relevant variables are known and so are the functional forms of the equations connecting them. Measurement can be done to high precision, and much is known about the nature of the errors—in the measurements and in the equations. Furthermore, there is ample opportunity for comparing predictions to reality.

Regression methods can be less successful for biological and social science applications. Before undertaking a univariate regression, you should have a fairly clear idea of the mechanistic nature of the relationship (and thus the form the regression function will take). Look for deviations from the model particularly at the extremes of the variable range. A plot of the residuals can be helpful in this regard; see, for example, Davison and Snell [1991] and Hardin and Hilbe [2003, pp. 143-159].

A preliminary multivariate analysis (the topic of the next two chapters) will give you a fairly clear notion of which variables are likely to be confounded so that you can correct for them by stratification. Stratification will also allow you to take advantage of permutation methods that are to be preferred in instances where “errors” or model residuals are unlikely to follow a normal distribution.

It’s also essential that you have firmly in mind the objectives of your analysis, and the losses associated with potential decisions, so that you can adopt the appropriate method of goodness of fit. The results of a regression analysis should be treated with care; as Freedman [1999] notes,

**59**> 60 61 62 63 64 65 .. 90 >> Next