# Common Errors in Statistics and How to Avoid Them - Good P.I

**Download**(direct link)

**:**

**67**> 68 69 70 71 72 73 .. 90 >> Next

Leave-one-out has the advantage of allowing us to study the influence of specific observations on the overall outcome.

Our own opinion is that if any of the above methods suggest that the model is unstable, the first step is to redefine the model over a more restricted range of the various variables. For example, with the data of Figure 9.3, we would advocate confining attention to observations for which the predictor (TNFAlpha) was less than 200.

If a more general model is desired, then many additional observations should be taken in underrepresented ranges. In the cited example, this would be values of TNFAlpha greater than 300.

MEASURES OF PREDICTIVE SUCCESS

Whatever method of validation is used, we need to have some measure of the success of the prediction procedure. One possibility is to use the sum of the losses in the calibration and the validation sample. Even this procedure contains an ambiguity that we need to resolve. Are we more concerned with minimizing the expected loss, the average loss, or the maximum loss?

One measure of goodness of fit of the model is SSE = S(y - y*)2, where y and y* denote the 2th observed value and the corresponding

CHAPTER 11 VALIDATION 159

value obtained from the model. The smaller this sum of squares, the better the fit.

If the observations are independent, then

X O';- y*)2 = X O';- y )2 -X(y - y*)2.

The first sum on the right-hand side of the equation is the total sum of squares (SST). Most statistics software uses as a measure of fit R2 =

1 - SSE/SST The closer the value of R2 is to 1, the better.

The automated entry of predictors into the regression equation using R2 runs the risk of overfitting, because R2 is guaranteed to increase with each predictor entering the model. To compensate, one may use the adjusted R2

1 - [(( - i )(1 - R2))/( - p)]

where n is the number of observations used in fitting the model, p is the number of estimated regression coefficients, and i is an indicator variable that is 1 if the model includes an intercept and is 0 otherwise.

The adjusted R2 has two major drawbacks according to Rencher and Pun [1980]:

1. The adjustment algorithm assumes the predictors are independent; more often the predictors are correlated.

2. If the pool of potential predictors is large, multiple tests are performed, and R2is inflated in consequence; the standard algorithm for adjusted R2 does not correct for this inflation.

A preferable method of guarding against overfitting the regression model, proposed by Wilks [1995], is to use validation as a guide for stopping the entry of additional predictors. Overfitting is judged to begin when entry of an additional predictor fails to reduce the prediction error in the validation sample.

Mielke et al. [1997] propose the following measure of predictive accuracy for use with either a mean-square-deviation or a mean-absolute-deviation loss function:

1 n 1 n n

M = 1 - 8/ms , where 8 = Xlyi - y*l and ms = XXly> - y* *

n ;=1 n ;=1 j=\

Uncertainty in Predictions

Whatever measure is used, the degree of uncertainty in your predictions should be reported. Error bars are commonly used for this purpose.

160 PART III BUILDING A MODEL

The prediction error is larger when the predictor data are far from their calibration-period means, and vice versa. For simple linear regression, the standard error of the estimate se and standard error of prediction sy* are related as follows:

\(n +1) , /v\2

Sy* = s^n + (xp - x) L,i=i (xi - x)

where n is the number of observations and xi is the ith value of the predictor in the calibration sample, and xp is the value of the predictor used for the prediction.

The relation between sy* and se is easily generalized to the multivariate case. In matrix terms, if Y = AX + E and y* = AXp, then s2y* = sf {1 + xTe(XTX)-1xt}.

This equation is only applicable if the vector of predictors lies inside the multivariate cluster of observations on which the model was based. An important question is how different can the predictor data be from the values observed in the calibration period before the predictions are considered invalid.

LONG-TERM STABILITY

Time is a hidden dimension in most economic models. Many an airline has discovered to its detriment that what was an optimal price today leads to half-filled planes and markedly reduced profits tomorrow. A careful reading of the newspapers lets them know a competitor has slashed prices, but more advanced algorithms are needed to detect a slow shifting in tastes of prospective passengers. The public, tired of being treated no better than hogs,2 turns to trains, personal automobiles, and teleconferencing.

An army base, used to a slow seasonal turnover in recruits, suddenly finds that all infirmary beds are occupied and the morning lineup for sick call stretches the length of a barracks.

To avoid a pound of cure:

Treat every model as tentative, best described, as any lawyer will advise you, as subject to change without notice.

**67**> 68 69 70 71 72 73 .. 90 >> Next