# Common Errors in Statistics and How to Avoid Them - Good P.I

**Download**(direct link)

**:**

**61**> 62 63 64 65 66 67 .. 90 >> Next

In the first of two passes through the “data,” all 50 of the explanatory variables were used. 15 coefficients out of the 50 were significant at the 25% level, and one out of the 50 was significant at the 5% level.

Focusing attention on the “explanatory” variables that proved significant on the first pass, a second model was constructed using only those 15 variables. The resulting model had an R2 of 0.36 and the model coefficients of six of the “explanatory” (but completely unrelated) variables were significant at the 5% level. Given these findings, how can we be sure if the statistically significant variables we uncover in our own research via regression methods are truly explanatory or are merely the result of chance?

A partial answer may be found in an article by Gail Gong published in 1986 and reproduced in its entirety in Appendix 2.

Gail Gong was among the first, if not the first, student to have the bootstrap as the basis of her doctoral dissertation. Reading her article, reprinted here with the permission of the American Statistical Association, we learn the bootstrap can be an invaluable tool for model validation, a result we explore at greater length in the following chapter. We also learn not to take for granted the results of a stepwise regression.

Gong [1986] constructed a logistic regression model based on observations Peter Gregory made on 155 chronic hepatitis patients, 33 of whom died. The object of the model was to identify patients at high risk. In contrast to the computer simulations David Freedman performed, the 19 explanatory variables were real, not simulated, derived from medical histories, physical examinations, X-rays, liver function tests, and biopsies.

If one or more extreme values can influence the slope and intercept of a univariate regression line, think how much more impact, and how

146 PART III BUILDING A MODEL

subtle the effect, these values might have on a curve drawn through 20dimensional space.1

Gong’s logistic regression models were constructed in two stages. At the first stage, each of the explanatory variables was evaluated on a univariate basis. Thirteen of these variables proved significant at the 5% level when applied to the original data. A forward multiple regression was applied to these thirteen variables and four were selected for use in the predictor equation.

When she took bootstrap samples from the 155 patients, the R2 values of the final models associated with each individual bootstrap sample, varied widely. Not reported in this article, but far more important, is that while two of the original four predictor variables always appeared in the final model generated from a bootstrap sample of the patients, five other variables appeared in only some of the models.

We strongly urge you to adopt Dr. Gong’s bootstrap approach to validating multi-variable models. Retain only those variables which appear consistently in the bootstrap regression models. Additional methods for model validation are described in Chapter 11.

Correcting for Confounding Variables

When your objective is to verify the association between predetermined explanatory variables and the response variable, multiple linear regression analysis permits you to provide for one or more confounding variables that could not be controlled otherwise.

GENERALIZED LINEAR MODELS

Today, most statistical software incorporates new advanced algorithms for the analysis of generalized linear models (GLMs)2 and extensions to panel data settings including fixed-, random- and mixed-effects models, logistic-, Poisson, and negative-binomial regression, GEEs, and HLMs. These models take the form Y = + e, where b is a vector of to-be-

determined coefficients, X is a matrix of explanatory variables, and e is a vector of identically distributed random variables. These variables may be normal, gamma, or Poisson depending on the specified variance of the GLM. The nature of the relationship between the outcome variable and the coefficients depend on the specified link function g of the GLM. Panel data models include the following:

Fixed Effects. An indicator variable for each subject is added and used to fit the model. Though often applied to the analysis of repeated measures,

1 That’s one dimension for risk of death, the dependent variable, and 19 for the explanatory variables.

2 As first defined by Nelder and Wedderburn [1972].

CHAPTER 10 MULTIVARIABLE REGRESSION 147

this approach has bias that increases with the number of subjects. If data include a large number of subjects, the associated bias of the results makes this a very poor model choice.

Conditional Fixed Effects. These are applied in logistic regression, Poisson regression, and negative binomial regression. A sufficient statistic for the subject effect is used to derive a conditional likelihood such that the subject level effect is removed from the estimation.

While conditioning out the subject level effect in this manner is algebraically attractive, interpretation of model results must continue to be in terms of the conditional likelihood. This may be difficult and the analyst must be willing to alter the original scientific questions of interest to questions in terms of the conditional likelihood.

**61**> 62 63 64 65 66 67 .. 90 >> Next