# Common Errors in Statistics and How to Avoid Them - Good P.I

**Download**(direct link)

**:**

**54**> 55 56 57 58 59 60 .. 90 >> Next

• Many of the charts could benefit from the addition of grid lines. Bar charts especially can benefit from horizontal grid lines from the y-axis labels. This is especially true of wider displays, but grid lines should be drawn in a lighter shade than the lines used to draw the major features of the graphic.

• Criticize your graphics and tables after production by isolating them with their associated caption. Determine if the salient information is obvious by asking a colleague to interpret the display. If we are serious about producing efficient communicative graphics, we must take the time ensure that our graphics are interpretable.

TO LEARN MORE

Wilkinson (1999) presents a formal grammar for describing graphics, but more importantly (for our purposes), the author lists graphical element hierarchies from best to worst. Cleveland (1985) focuses on the elements of common illustrations where he explores the effectiveness of each element in communicating numeric information. A classic text is Tukey (1977), where the author lists both graphical and text-based graphical summaries of data. More recently, Tufte (1983, 1990) organized much of the previous work and combined that work with modern developments. For specific illustrations, subject-specific texts can be consulted for particular displays in context; for example, Hardin and Hilbe (2003, pp. 143-167) illustrate the use of graphics for assessing model accuracy.

CHAPTER S GRAPHICS 125

Part III

BUILDING A MODEL

Chapter 9

Univariate Regression

Are the data adequate ? Does your data set cover the entire range of interest? Will your model depend on one or two isolated data points?

The SIMPLEST EXAMPLE OF a MODEL, THE RELATIONSHIP between exactly two variables, illustrates at least five of the many complications that can interfere with the task of model building:

1. Limited scope—the model we develop may be applicable for only a portion of the range of each variable.

2. Ambiguous form of the relationship—a variable may give rise to a statistically significant linear regression without the underlying relationship being a straight line.

3. Confounding—undefined confounding variables may create the illusion of a relationship or may mask an existing one.

4. Assumptions—the assumptions underlying the statistical procedures we use may not be satisfied.

5. Inadequacy—goodness of fit is not the same as prediction.

We consider each of these error sources in turn along with a series of preventive measures. Our discussion is divided into problems connected with model selection and difficulties that arise during the estimation of model coefficients.

MODEL SELECTION

Limited Scope

Almost every relationship has both a linear and a nonlinear portion where the nonlinear portion is increasingly evident for both extremely large and

CHAPTER 9 UNIVARIATE REGRESSION 129

extremely small values. One can think of many examples from physics such as Boyle’s Law, which fails at high pressures, and particle symmetries that are broken as the temperature falls. In medicine, radio immune assay fails to deliver reliable readings at very low dilutions and for virtually every drug there will always be an increasing portion of nonresponders as the dosage drops. In fact, almost every measuring device—electrical, electronic, mechanical, or biological—is reliable only in the central portion of its scale.

We need to recognize that while a regression equation may be used for interpolation within the range of measured values, we are on shaky ground if we try to extrapolate, to make predictions for conditions not previously investigated. The solution is to know the range of application and to recognize, even if we do not exactly know the range, that our equations will be applicable to some but not all possibilities.

Ambiguous Relationships

Think why rather than what.

The exact nature of the formula connecting two variables cannot be determined by statistical methods alone. If a linear relationship exists between two variables X and Y, then a linear relationship also exists between Y and any monotone (nondecreasing or nonincreasing) function of X. Assume that X can only take positive values. If we can fit Model I: Y = a + pX + e to the data, we also can fit Model II: Y = a' + blog [X] + e, and Model III: Y = a" + P"X + gX2 + e. It can be very difficult to determine which model, if any, is the “correct” one in either a predictive or mechanistic sense.

A graph of Model I is a straight line (see Figure 9.1). Because Y includes a stochastic or random component e, the pairs of observations (x1, y{), (x2, y2), . . . will not fall exactly on this line but above and below it. The function log[X] does not increase as rapidly as X does; when we fit Model II to these same pairs of observations, its graph rises above that of Model I for small values of X and falls below that of Model I for large values. Depending on the set of observations, Model II may give just as good a fit to the data as Model I.

**54**> 55 56 57 58 59 60 .. 90 >> Next