# Common Errors in Statistics and How to Avoid Them - Good P.I

**Download**(direct link)

**:**

**62**> 63 64 65 66 67 68 .. 90 >> Next

Questions always arise as to whether some function of the independent variable might be more appropriate to use than the independent variable itself. For example, suppose X = Z2 where E(Y\Z) satisfies the logistic equation; then E(Y \X) does not.

Random Effects. The choice of a distribution for the random effect too often is driven by the need to find an analytic solution to the problem, rather than by any actual knowledge. If we assume a normally distributed random effect when the random effect is really Laplace, we will get the same point estimates (since both distributions have mean zero), but we will get different standard errors. We will not have any way of checking the approaches short of fitting both models.

If the true random effects distribution has a nonzero mean, then the misspecification is more troublesome as the point estimates of the fitted model are different from those that would be obtained from fitting the true model. Knowledge of the true random-effects distribution does not alter the interpretation of fitted model results. Instead, we are limited to discussing the relationship of the fitted parameters to those parameters we would obtain if we had access to the entire population of subjects, and we fit that population to the same fitted model. In other words, even given the knowledge of the true random effects distribution, we cannot easily compare fitted results to true parameters.

As discussed in Chapter 5 with respect to group-randomized trials, if the subjects are not independent (say, they all come from the same classroom), then the true random effect is actually larger. The attenuation of our fitted coefficient increases as a function of the number of supergroups containing our subjects as members; if classrooms are within schools and there is within school correlation, the attenuation is even greater.

GEE (Generalized Estimating Equation). Instead of trying to derive the estimating equation for GLM with correlated observations from a likeli-

148 PART III BUILDING A MODEL

hood argument, the within subject correlation is introduced in the estimating equation itself. The correlation parameters are then nuisance parameters and can be estimated separately. (See also Hardin and Hilbe,

2003.)

Underlying the population-averaged GEE is the assumption that one is able to specify the correct correlation structure. If one hypothesizes an exchangeable correlation and the true correlation is time-dependent, the resulting regression coefficient estimator is inefficient. The naive variance estimates of the regression coefficients will then produce incorrect confidence intervals. Analysts specify a correlation structure to gain efficiency in the estimation of the regression coefficients, but typically calculate the sandwich estimate of variance to protect against misspecification of the correlation.3 This variance estimator is more variable than the naive variance estimator, and many analysts do not pay adequate attention to the fact that the asymptotic properties depend on the number of subjects (not the total number of observations).

HLM. This includes hierarchical linear models, linear latent models, and others. While previous models are limited for the most part to a single effect, HLM allows more than one. Unfortunately, most commercially available software requires one to assume that each random effect is Gaussian with mean zero. The variance of each random effect must be estimated.

Mixed Models. These allow both linear and nonlinear mixed effects regression (with various links). They allow you to specify each level of repeated measures. Imagine: districts: schools: teachers: classes: students.

In this description, each of the sublevels is within the previous level and we can hypothesize a fixed or random effect for each level. We also imagine that observations within same levels (any of these specific levels) are correlated.

The caveats revealed in this and the previous chapter apply to the GLMs. The most common sources of error are the use of an inappropriate or erroneous link function, the wrong choice of scale for an explanatory variable (for example, using x rather than log[x]), neglecting important variables, and the use of an inappropriate error distribution when computing confidence intervals and p values. Firth [1991, pp. 74-77] should be consulted for a more detailed analysis of potential problems.

REPORTING YOUR RESULTS

In reporting the results of your modeling efforts you need to be explicit about the methods used, the assumptions made, the limitations on your

3 See Hardin and Hilbe [2003, p. 28] for a more detailed explanation.

CHAPTER 10 MULTIVARIABLE REGRESSION 149

model’s range of application, potential sources of bias, and the method of validation (see the following chapter). The section on “Limitations of the Logistic Regression” from Bent and Archfield [2002] is ideal in this regard: “The logistic regression equation developed is applicable for stream sites with drainage areas between 0.02 and 7.00mi2 in the South Coastal Basin and between 0.14 and 8.94mi2 in the remainder of Massachusetts, because these were the smallest and largest drainage areas used in equation development for their respective areas.” (The authors go on to subdivide the area.)

**62**> 63 64 65 66 67 68 .. 90 >> Next