Download (direct link):
3. A quarterly newsletter sent to participants will substantially increase retention (and help you keep track of address changes).
DETERMINING SAMPLE SIZE
Determining optimal sample size is simplicity itself once we specify all of the following:
• Desired power and significance level.
• Distributions of the observables.
• Statistical test(s) that will be employed.
• Anticipated losses due to nonresponders, noncompliant participants, and dropouts.
Power and Significance Level
Understand the relationships among sample size, significance level, power, and precision of the measuring instruments.
Sample size must be determined for each experiment; there is no universally correct value (Table 3.1). Increase the precision (and hold all other parameters fixed) and we can decrease the required number of observations.
28 PART I FOUNDATIONS
TABLE 3.1 Ingredients in a Sample Size Calculation
Type I error (a)
Type II error (1 - y0[A])
Power = y0[A]
Distribution functions Location parameters Scale parameters Sample sizes
Probability of falsely rejecting the hypothesis when it is true.
Probability of falsely accepting the hypothesis when an alternative hypothesis A is true. Depends on the alternative A.
Probability of correctly rejecting the hypothesis when an alternative hypothesis A is true. Depends on the alternative A.
F[(x - M)s], e.g., normal distribution.
For both hypothesis and alternative hypothesis: Mi, M2.
For both hypothesis and alternative hypothesis: s, s2.
May be different for different groups in an experiment with more than one group
Permit a greater number of Type I or Type II errors (and hold all other parameters fixed) and we can decrease the required number of observations.
Explicit formula for power and significance level are available when the underlying observations are binomial, the results of a counting or Poisson process, or normally distributed. Several off-the-shelf computer programs including nQuery Advisor™, Pass 2000TM, and StatXactTM are available to do the calculations for us.
To use these programs, we need to have some idea of the location (mean) and scale parameter (variance) of the distribution both when the primary hypothesis is true and when an alternative hypothesis is true.
Since there may well be an infinity of alternatives in which we are interested, power calculations should be based on the worst-case or boundary value. For example, if we are testing a binomial hypothesis p = 1/2 against the alternatives p < 2/3, we would assume that p = 2/3.
If the data do not come from one of the preceding distributions, then we might use a bootstrap to estimate the power and significance level.
In preliminary trials of a new device, the following test results were observed: 7.0 in 11 out of 12 cases and 3.3 in 1 out of 12 cases. Industry guidelines specified that any population with a mean test result greater than 5 would be acceptable. A worst-case or boundary-value scenario would include one in which the test result was 7.0 3/7th of the time, 3.3 3/7th of the time, and 4.1 1/7th of the time.
The statistical procedure required us to reject if the sample mean of the test results were less than 6. To determine the probability of this event for various sample sizes, we took repeated samples with replacement from the two sets of test results. Some bootstrap samples consisted of all 7’s, whereas some, taken from the worst-case distribution, consisted only of
CHAPTER 3 COLLECTING DATA 29
TABLE 3.2 Power Estimates
Sample Size Test Mean < 6
3 0.23 0.84
4 0.04 0.80
5 0.06 0.89
3.3’s. Most were a mixture. Table 3.2 illustrates the results; for example, in our trials, 23% of the bootstrap samples of size 3 from our starting sample of test results had medians less than 6. If, instead, we drew our bootstrap samples from the hypothetical “worst-case” population, then 84% had medians less than 6.
If you want to try your hand at duplicating these results, simply take the test values in the proportions observed, stick them in a hat, draw out bootstrap samples with replacement several hundred times, compute the sample means, and record the results. Or you could use the StataTM bootstrap procedure as we did.1
Prepare for Missing Data
The relative ease with which a program like Stata or StatXact can produce a sample size may blind us to the fact that the number of subjects with which we begin a study may bear little or no relation to the number with which we conclude it.
A midsummer hailstorm, an early frost, or an insect infestation can lay waste to all or part of an agricultural experiment. In the National Institute of Aging’s first years of existence, a virus wiped out the entire primate colony destroying a multitude of experiments in progress.
Large-scale clinical trials and surveys have a further burden, namely, the subjects themselves. Potential subjects can and do refuse to participate. (Don’t forget to budget for a follow-up study, bound to be expensive, of responders versus nonresponders.) Worse, they agree to participate initially, then drop out at the last minute (see Figure 3.1).