# Common Errors in Statistics and How to Avoid Them - Good P.I

**Download**(direct link)

**:**

**55**> 56 57 58 59 60 61 .. 90 >> Next

How Model III behaves will depend upon whether p" and a" are both positive or whether one is positive and the other negative. If p" and a" are both positive, then the graph of Model III will lie below the graph of Model I for small positive values of X and above it for large values. If p" is positive and a" is negative, then Model III will behave more like Model II. Thus Model III is more flexible than either Models I or II and can usually be made to give a better fit to the data—that is, to minimize some

130 PART III BUILDING A MODEL

o y a Fitted values

1 -

-.2

A

A

O A O

A

A

O A A O

O

0 A A

FIGURE 9.1 A Straight Line Appears to Fit the Data.

0

0

2

x

function of the differences between what is observed, y, and what is predicted by the model, Y [x].

The coefficients a, b, g for all three models can be estimated by a technique known (to statisticians) as linear regression. Our knowledge of this technique should not blind us to the possibility that the true underlying model may require nonlinear estimation as in

Model IV: Y = a + bX + gX + e.

d - fX

This latter model may have the advantage over the first three in that it fits the data over a wider range of values.

Which model should we choose? At least two contradictory rules apply:

• The more parameters, the better the fit; thus, Model III and Model IV are to be preferred.

• The simpler, more straightforward model is more likely to be correct when we come to apply it to data other than the observations in hand; thus, Models I and II are to be preferred.

CHAPTER 9 UNIVARIATE REGRESSION 131

o z a Fitted values

1 -

o o

o

-.2

aaaaaaaaaaaaaaaaaaaaa

o

o

o

o

FIGURE 9.2 Fitting an Inappropriate Model.

0

0

2

x

Again, the best rule of all is not to let statistics do your thinking for you, but to inquire into the mechanisms that give rise to the data and that might account for the relationship between the variables X and Y. An example taken from physics is the relationship between volume V and temperature T of a gas. All of the preceding four models could be used to fit the relationship. But only one, the model V = a + KT, is consistent with Kinetic Molecular Theory.

Inappropriate Models

An example in which the simpler, more straightforward model is not correct comes when we try to fit a straight line to what is actually a higher-order polynomial. For example, suppose we tried to fit a straight line to the relationship Y = (X - 1)2 over the range X = (0,+2). We’d get a line with slope 0 similar to that depicted in Figure 9.2. With a correlation of 0, we might even conclude in error that X and Y were not related. Figure 9.2 suggests a way we can avoid falling into a similar trap. Always plot the data before deciding on a model.

132 PART III BUILDING A MODEL

900

800 700 600 cp 500 400 300 200 100 0

0 100 200 300 400 500 600 700

TNF-alpha

FIGURE 9.3 Relation Between Two Inflammatory Reaction Mediators in Response to Silicone Exposure. Data taken from Mena et al. [1995].

The data in Figure 9.3 are taken from Mena et al. [1995]. These authors reported in their abstract that, “The correlation . . . between IL-6 and TNF-alpha was .77 . . . statistically significant at a p-value less than .01.” Would you have reached the same conclusion?

With more complicated models, particularly those like Model IV that are nonlinear, it is advisable to calculate several values that fall outside the observed range. If the results appear to defy common sense (or the laws of physics, market forces, etc.), the nonlinear approach should be abandoned and a simpler model utilized.

Often it can be difficult to distinguish which variable is the cause and which one is the effect. But if the values of one of the variables are fixed in advance, then this variable should always be treated as the so-called independent variable or cause, the X in the equation Y = a + bX + e.

Here is why:

When we write Y = a + bx + e, we actually mean Y = E(Y |x) + e, where E(Y|X) = a + bx is the expected value of an indefinite number of independent observations of Y when X = x. If X is fixed, the inverse equation x = (E(x|Y) - a)/b + e = makes little sense.

Confounding Variables

If the effects of additional variables other than X on Y are suspected, these additional effects should be accounted for either by stratifying or by performing a multivariate regression.

CHAPTER 9 UNIVARIATE REGRESSION 133

SOLVE THE RIGHT PROBLEM

Don't be too quick to turn on the computer. Bypassing the brain to compute by reflex is a sure recipe for disaster.

Be sure of your objectives for the model: Are you trying to uncover cause-and-effect mechanisms? Or derive a formula for use in predictions? If the former is your objective, standard regression methods may not be appropriate.

A researcher studying how neighborhood poverty levels affect violent crime rates hit an apparent statistical roadblock. Some important criminological theories suggest that this positive relationship is curvilinear with an accelerating slope while other theories suggest a decelerating slope. As the crime data are highly variable, previous analyses had used the logarithm of the primary end point—violent crime rate—and reported a significant negative quadratic term (poverty*poverty) in their least-squares models. The researcher felt that such results were suspect, that the log transformation alone might have biased the results toward finding a significant negative quadratic term for poverty.

**55**> 56 57 58 59 60 61 .. 90 >> Next