# Common Errors in Statistics and How to Avoid Them - Good P.I

**Download**(direct link)

**:**

**56**> 57 58 59 60 61 62 .. 90 >> Next

But quadratic terms and log transforms are irrelevancies, artifacts resulting from an attempt to squeeze the data into the confines of a linear regression model. The issue appears to be whether the rate of change of crime rates with poverty levels is a constant, increasing, or decreasing function of poverty levels. Resolution of this issue requires a totally different approach.

Suppose Y denotes the variable you are trying to predict and X denotes the predictor. Replace each of the y[i] by the slope y*[i] = (y[i + 1] -y[i])/(x[i + 1] - x[i]). Replace each of the x[i] by the midpoint of the interval over which the slope is measured, x*[i] = (x[i + 1] - x[i])/2. Use the permutation methods described in Chapter 5 to test for the correlation if any between y* and x*. A positive correlation means an accelerating slope, a negative correlation, a decelerating slope.

Correlations can be deceptive. Variable X can have a statistically significant correlation with variable Y, solely because X and Y are both dependent on a third variable Z. A fall in the price of corn is inversely proportional to the number of hay-fever cases only because the weather that produces a bumper crop of corn generally yields a bumper crop of ragweed as well.

Even if the causal force X under consideration has no influence on the dependent variable Y, the effects of unmeasured selective processes can produce an apparent test effect. Children were once taught that storks brought babies. This juxtaposition of bird and baby makes sense (at least to a child) because where there are houses there are both families and chimneys where storks can nest. The bad air or miasma model (“common sense” two centuries ago) works rather well at explaining respiratory illnesses and not at all at explaining intestinal ones. An understanding of the

134 PART III BUILDING A MODEL

role that bacteria and viruses play unites the two types of illness and enriches our understanding of both.

We often try to turn such pseudo-correlations to advantage in our research, using readily measured proxy variables in place of their less easily measured “causes.” Examples are our use of population change in place of economic growth, M2 for the desire to invest, arm cuff blood pressure measurement in place of the width of the arterial lumen, and tumor size for mortality. At best, such surrogate responses are inadequate (as in attempting to predict changes in stock prices); in other instances they may actually point in the wrong direction.

At one time, the level of CD-4 lymphocytes in the blood appeared to be associated with the severity of AIDs; the result was that a number of clinical trials used changes in this level as an indicator of disease status. Reviewing the results of 16 sets of such trials, Fleming [1995] found that the concentration of CD-4 rose to favorable levels in 13 instances even though clinical outcomes were only favorable in eight.

Stratification

Gender discrimination lawsuits based on the discrepancy in pay between men and women could be defeated once it was realized that pay was related to years in service and that women who had only recently arrived on the job market in great numbers simply didn’t have as many years on the job as men.

These same discrimination lawsuits could be won once the gender comparison was made on a years-in-service basis—that is, when the salaries of new female employees were compared with those of newly employed men, when the salaries of women with three years of service were compared with those of men with the same time in grade, and so forth. Within each stratum, men always had the higher salaries.

If the effects of additional variables other than X on Y are suspected, they should be accounted for either by stratifying or by performing a multivariate regression as described in the next chapter.

The two approaches are not equivalent unless all terms are included in the multivariate model. Suppose we want to account for the possible effects of gender. Let I [ ] be an indicator function that takes the value 1 if its argument is true and 0 otherwise. Then to duplicate the effects of stratification, we would have to write the multivariate model in the following form:

Y = amI [male] + af (1 - I [male]) + bm I [male]X + bf (1 - I [male]) + e.

In a study by Kanarek et al. [1980], whose primary focus is the relation between asbestos in drinking water and cancer, results are stratified by sex,

CHAPTER 9 UNIVARIATE REGRESSION 135

race, and census tract. Regression is used to adjust for income, education, marital status, and occupational exposure.

Lieberson [1985] warns that if the strata differ in the levels of some third unmeasured factor that influences the outcome variable, the results may be bogus.

Simpson's Paradox

A third omitted variable may also result in two variables appearing to be independent when the opposite is true. Consider the following table, an example of what is termed Simpson’s paradox:

Treatment Group

Control Treated

**56**> 57 58 59 60 61 62 .. 90 >> Next