# Common Errors in Statistics and How to Avoid Them - Good P.I

**Download**(direct link)

**:**

**43**> 44 45 46 47 48 49 .. 90 >> Next

3. When the data span several orders of magnitude, as with the concentration of pollutants.

Because bacterial populations can double in number in only a few hours, many government health regulations utilize the geometric rather than the arithmetic mean.7 A number of other government regulations also use it, though the sample median would be far more appropriate.8

Whether you report a mean or a median, be sure to report only a sensible number of decimal places. Most statistical packages can give you 9 or 10. Don’t use them. If your observations were to the nearest integer, your report on the mean should include only a single decimal place. For guides to the appropriate number of digits, see Ehrenberg [1977]; for percentages, see van Belle [2002, Table 7.4].

The standard error is a useful measure of population dispersion if the observations come from a normal or Gaussian distribution. If the observations are normally distributed as in the bell-shaped curve depicted in Figure 7.1, then in 95% of the samples we would expect the sample mean to lie within two standard errors of the mean of our original sample.

But if the observations come from a nonsymmetric distribution like an exponential or a Poisson, or a truncated distribution like the uniform, or a mixture of populations, we cannot draw any such inference.

Recall that the standard error equals the standard deviation divided by

'Y.fa - x)2

the square root of the sample size, SE = —. ?. Because the stan-

V n(n -1)

dard error depends on the squares of individual observations, it is particularly sensitive to outliers. A few extra large observations will have a dramatic impact on its value.

If you can’t be sure your observations come from a normal distribution, then consider reporting your results either in the form of a histogram (as in Figure 7.2) or in a box and whiskers plot (Figure 7.3). See also Lang and Secic [1997, p. 50].

7 See, for example, 40 CFR part 131, 62 Fed. Reg. 23004 at 23008 (28 April 1997).

8 Examples include 62 Fed. Reg. 45966 at 45983 (concerning the length of a hospital stay) and 62 Fed. Reg. 45116 at 45120 (concerning sulfur dioxide emissions).

96 PART II HYPOTHESIS TESTING AND ESTIMATION

>.

-Q

03

O

CL

FIGURE 7.1 Bell-Shaped Symmetric Curve of a Normal Distribution.

cr

p

135

Height (cm)

FIGURE 7.2 Histogram of Heights in a Sixth-Grade Class.

170

If your objective is to report the precision of your estimate of the mean or median, then the standard error may be meaningful providing the mean of your observations is normally distributed.

The good news is that the sample mean often will have a normal distribution even when the observations themselves do not come from a

6

0

CHAPTER 7 REPORTING YOUR RESULTS 97

Treatment

FIGURE 7.3 Box and Whiskers Plot. The box encompasses the middle 50% of each sample while the “whiskers” lead to the smallest and largest values. The line through the box is the median of the sample; that is, 50% of the sample is larger than this value, while 50% is smaller. The plus sign indicates the sample mean. Note that the mean is shifted in the direction of a small number of very large values.

normal distribution. This is because the sum of a large number of random variables each of which makes only a small contribution to the total is a normally distributed random variable.9 And in a sample mean based on n observations, each contributes only 1/nth the total. How close the fit is to a normal will depend upon the size of the sample and the distribution from which the observations are drawn.

The distribution of a uniform random number H[0,1] is a far cry from the bell-shaped curve of Figure 7.1. Only values between 0 and 1 have a positive probability, and in stark contrast to the normal distribution, no range of values between zero and one is more likely than another of the same length. The only element the uniform and the normal distributions have in common is their symmetry about the population mean. Yet to obtain normally distributed random numbers for use in simulations a frequently employed technique is to generate 12 uniformly distributed random numbers and then take their average.

9 This result is generally referred to as the Central Limit Theorem. Formal proof can be found in a number of texts including Feller [1966, p. 253].

98 PART II HYPOTHESIS TESTING AND ESTIMATION

142.25 Medians of bootstrap samples 158.25

FIGURE 7.4 Rugplot of 50 Bootstrap Medians Derived from a Sample of Sixth Grader’s Heights.

Apparently, 12 is a large enough number for a sample mean to be normally distributed when the variables come from a uniform distribution.

But if you take a smaller sample of observations from a H[0,1] population, the distribution of its mean would look less like a bell-shaped curve.

A loose rule of thumb is that the mean of a sample of 8 to 25 observations will have a distribution that is close enough to the normal for the standard error to be meaningful. The more nonsymmetric the original distribution, the larger the sample size required. At least 25 observations are needed for a binomial distribution with p = 0.1.

**43**> 44 45 46 47 48 49 .. 90 >> Next