Download (direct link):
CHAPTER 10 MULTIVARIABLE REGRESSION 151
A great deal of publicity has heralded the arrival of new and more powerful data mining methodsneural networks, CART, and dozens of unspecified proprietary algorithms. In our limited experience, none of these have lived up to expectations; see a report of our tribulations in Good [2001a, Section 7.6]. Most of the experts weve consulted have attributed this failure to the small size of our test data set, 400 observations each with 30 variables. In fact, many publishers of data mining software assert that their wares are designed solely for use with terra-bytes of information.
This observation has led to our putting our experience in the form of the following conjecture.
If m points are required to determine a univariate regression line with sufficient precision, then it will take at least mn observations and perhaps n!mn observations to appropriately characterize and evaluate a model with n variables.
BUILDING A SUCCESSFUL MODEL
Rome was not built in one day,4 nor was any reliable model. The only successful approach to modeling lies in a continuous cycle of hypothesis formulation-data gathering-hypothesis testing and estimation. How you go about it will depend on whether you are new to the field, have a small data set in hand, and are willing and prepared to gather more until the job is done, or you have access to databases containing hundreds of thousands of observations. The following prescription, while directly applicable to the latter case, can be readily modified to fit any situation.
1. A thorough literature search and an understanding of casual mechanisms is an essential prerequisite to any study. Dont let the software do your thinking for you.
2. Using a subset of the data selected at random, see which variables appear to be correlated with the dependent variable(s) of interest. (As noted in this and the preceding chapter, two unrelated variables may appear to be correlated by chance alone or as a result of confounding factors. For the same reasons, two closely related factors may fail to exhibit a statistically significant correlation.)
3. Using a second, distinct subset of the data selected at random, see which of the variables selected at the first stage still appear to be correlated with the dependent variable(s) of interest. Alternately, use the bootstrap method describe by Gong  to see which variables are consistently selected for inclusion in the model.
4. Limit attention to one or two of the most significant predictor variables. Select a subset of the existing data which the remainder
4 John Heywood, Proverbes, Part i, Chapter xi, 16th Century.
152 PART III BUILDING A MODEL
of the significant variables are (almost) constant. (Alternately, gather additional data for which the remainder of the significant variables are almost constant.) Decide on a generalized linear model form which best fits your knowledge of the causal relations among the few variables on which you are now focusing. (A standard multivariate linear regression may be viewed as just another form, albeit a particularly straightforward one, of generalized linear model.) Fit this model to the data.
5. Select a second subset of the existing data (or gather an additional data set) for which the remainder of the significant variables are (almost) equal to a second constant. For example, if only men were considered at stage four, then you should focus on women at this stage. Attempt to fit the model you derived at the preceding stage to these data.
6. By comparing the results obtained at stages four and five, you can determine whether to continue to ignore or to include variables previously excluded from the model. Only one or two additional variables should be added to the model at each iteration of steps 4 through 6.
7. Always validate your results as described in the next chapter.
If all this sounds like a lot of work, it is. It takes several years to develop sound models, even or despite the availability of lightning fast, multifunction statistical software. The most common error in statistics is to assume that statistical procedures can take the place of sustained effort.
TO LEARN MORE
Inflation of R2 as a consequence of multiple tests also was considered by Rencher .
Osborne and Waters  review tests of the assumptions of multivariable regression. Harrell, Lee, and Mark  review the effect of violation of assumptions on GLMs and suggest the use of the bootstrap for model validation. Hosmer and Lemeshow  recommend the use of the bootstrap or some other validation procedure before accepting the results of a logistic regression.
Diagnostic procedures for use in determining an appropriate functional form are described by Mosteller and Tukey , Therneau and Grambsch , Hosmer and Lemeshow , and Hardin and Hilbe .
CHAPTER 10 MULTIVARIABLE REGRESSION 153
. .. the simple idea of splitting a sample in two and then developing the hypothesis on the basis of one part and testing it on the remainder may perhaps be said to be one of the most seriously neglected ideas in statistics. If we measure the degree of neglect by the ratio of the number of cases where a method could help to the number of cases where it is actually used. G. A. Barnard in discussion following Stone [1974, p. 133].