Download (direct link):
“Even if significance can be determined and the null hypothesis rejected or accepted, there is a much deeper problem. To make causal inferences, it must in essence be assumed that equations are invariant under proposed interventions. . . . if the coefficients and error terms change when the variables on the right hand side of the equation are manipulated rather than being passively observed, then the equation has only a limited utility for predicting the results of interventions.”
4 Most published methods also require that the residuals be normally distributed.
142 PART III BUILDING A MODEL
Statistically significant findings should serve as a motivation for further corroborative and collateral research rather than as a basis for conclusions.
Checklist: Write down and confirm your assumptions before you begin.
• Data cover an adequate range. Slope of line not dependent on a few isolated values.
• Model is plausible and has or suggests a causal basis.
• Relationships among variables remained unchanged during the data collection period and will remain unchanged in the near future.
• Uncontrolled variables are accounted for.
• Loss function is known and will be used to determine the goodness of fit criteria.
• Observations are independent, or the form of the dependence is known or is a focus of the investigation.
• Regression method is appropriate for the types of data involved and the nature of the relationship.
• Is the distribution of residual errors known?
TO LEARN MORE
David Freedman’s  article on association and causation is must reading. Lieberson  has many examples of spurious association. Friedman, Furberg and DeMets  cite a number of examples of clinical trials using misleading surrogate variables.
Mosteller and Tukey  expand on many of the points raised here concerning the limitations of linear regression. Mielke and Berry [2001, Section 5.4] provide a comparison of MRPP, Cade-Richards, and OLS regression methods. Distribution-free methods for comparing regression lines among strata are described by Good [2001, pp. 168-169].
For more on Simpson’s paradox, see http://www.cawtech.freeserve.co.uk/simpsons.2.html. For a real-world example, search under Simpson’s paradox for an analysis of racial bias in New Zealand Jury Service at http://www.stats.govt.nz.
CHAPTER 9 UNIVARIATE REGRESSION 143
Multivariable regression is plagued by the same problems univariate regression is heir to, plus many more of its own. Is the model correct? Are the associations spurious?
In the univariate case, if the errors were not normally distributed, we could take advantage of permutation methods to obtain exact significance levels in tests of the coefficients. Exact permutation methods do not exist in the multivariable case.
When selecting variables to incorporate in a multivariable model, we are forced to perform repeated tests of hypotheses, so that the resultant p values are no longer meaningful. One solution, if sufficient data are available, is to divide the data set into two parts, using the first part to select variables and using the second part to test these same variables for significance.
If choosing the correct functional form of a model in a univariate case presents difficulties, consider that in the case of k variables, there are k linear terms (should we use logarithms? should we add polynomial terms?) and k(k - 1) first-order cross products of the form xixk. Should we include any of the k(k - 1)(k - 2) second-order cross products?
Should we use forward stepwise regression, or backward, or some other method for selecting variables for inclusion? The order of selection can result in major differences in the final form of the model (see, for example, Roy  and Goldberger ).
David Freedman  searched for and found a large and highly significant R2 among totally independent normally distributed random variables. This article is reproduced in its entirety in Appendix A, and we urge you to read this material more than once. Freedman demonstrates how
CHAPTER 10 MULTIVARIABLE REGRESSION 145
the testing of multiple hypotheses, a process that typifies the method of stepwise regression, can only exacerbate the effects of spurious correlation. As he notes in the introduction to the article, “If the number of variables is comparable to the number of data points, and if the variables are only imperfectly correlated among themselves, then a very modest search procedure will produce an equation with a relatively small number of explanatory variables, most of which come in with significant coefficients, and a highly significant R2. This will be so even if Y is totally unrelated to the X’s”
Freedman used computer simulation to generate 5100 independent normally distributed “observations.” He put these values into a data matrix in the form required by the SAS regression procedure. His organization of the values defined 100 “observations” on each of 51 random variables. Arbitrarily, the first 50 variables were designated as “explanatory” and the 51st as the dependent variable Y.