# Common Errors in Statistics and How to Avoid Them - Good P.I

**Download**(direct link)

**:**

**57**> 58 59 60 61 62 63 .. 90 >> Next

Alive 6 20

Dead 6 20

We don’t need a computer program to tell us the treatment has no effect on the death rate. Or does it? Consider the following two tables that result when we examine the males and females separately:

Treatment Group

Control Treated

Alive 4 8

Dead 3 5

Treatment Group

Control Treated

Alive 2 12

Dead 3 15

In the first of these tables, treatment reduces the male death rate from 3 out of 7 (0.43) to 5 out of 13 (0.38). In the second, the rate is reduced from 3 out of 5 (0.6) to 15 out of 27 (0.55). Both sexes show a reduction, yet the combined population does not. Resolution of this paradox is accomplished by avoiding a knee-jerk response to statistical significance when association is involved. One needs to think deeply about underlying cause-and-effect relationships before analyzing data. Thinking about cause and effect in the preceding example might have led us to think about possible sexual differences and to stratify the data by sex before analyzing it.

136 PART III BUILDING A MODEL

ESTIMATING COEFFICIENTS

Write down and confirm your assumptions before you begin.

In this section we consider problems and solutions associated with three related challenges:

1. Estimating the coefficients of a model.

2. Testing hypotheses concerning the coefficients.

3. Estimating the precision of our estimates.

The techniques we employ will depend upon the following:

1. The nature of the regression function (linear, nonlinear, logistic).

2. The nature of the losses associated with applying the model.

3. The distribution of the error terms in the model—that is, the e’s.

4. Whether these error terms are independent or dependent.

The estimates we obtain will depend upon our choice of fitting function. Our choice should not be dictated by the software but by the nature of the losses associated with applying the model. Our software may specify a least-squares fit—most commercially available statistical packages do— but our real concern may be with minimizing the sum of the absolute values of the prediction errors or the maximum loss to which one will be exposed.

Algorithms for least absolute deviation (LAD) regression are given in Barrodale and Roberts [1973]. The qreg function of Stata provides for LAD regression. The Blossom package available as freeware from http://www.mesc.usgs.gov/blossom/blossom.html includes procedures for LAD and quantile regression.

In the univariate linear regression model, we assume that

y = E (Y| x) + e

where E denotes the mathematical expectation of Y given x and could be any deterministic function of x in which the parameters appear in linear form. e, the error term, stands for all the other unaccounted for factors that make up the observed value y.

How accurate our estimates are and how consistent they will be from sample to sample will depend upon the nature of the error terms. If none of the many factors that contribute to the value of e make more than a small contribution to the total, then e will have a Gaussian distribution. If the {ei} are independent and normally distributed (Gaussian), then the ordinary least-squares estimates of the coefficients produced by most statistical software will be unbiased and have minimum variance.

CHAPTER 9 UNIVARIATE REGRESSION 137

These desirable properties, indeed the ability to obtain coefficient values that are of use in practical applications, will not be present if the wrong model has been adopted. They will not be present if successive observations are dependent. The values of the coefficients produced by the software will not be of use if the associated losses depend on some function of the observations other than the sum of the squares of the differences between what is observed and what is predicted. In many practical problems, one is more concerned with minimizing the sum of the absolute values of the differences or with minimizing the maximum prediction error. Finally, if the error terms come from a distribution that is far from Gaussian, a distribution that is truncated, flattened, or asymmetric, the p values and precision estimates produced by the software may be far from correct.

Alternatively, we may use permutation methods to test for the significance of the resulting coefficients. Provided that the {e;| are independent and identically distributed (Gaussian or not), the resulting p values will be exact. They will be exact regardless of which goodness-of-fit criterion is employed.

Suppose that our hypothesis is that y = a + bxt + ei for all i and b = b0. First, we substitute yI = y - b0xi in place of the original observations y. Our translated hypothesis is y^ = a + bx{ + ei for all i and b = 0 or, equivalently, p = 0, where p is the correlation between the variables Y' and X. Our test for correlation is based on the permutation distribution of the sum of the cross-products y'i xt (Pitman, 1938). Alternative tests based on permutations include those of Cade and Richards [1996], and tests based on MRPP LAD regression include those of Mielke and Berry [1997].

**57**> 58 59 60 61 62 63 .. 90 >> Next