# Common Errors in Statistics and How to Avoid Them - Good P.I

**Download**(direct link)

**:**

**72**> 73 74 75 76 77 78 .. 90 >> Next

q = q (F, f) = Ex t~F Q yy 0, h F yt 0)).

In addition, I call the quantity

A A 1 n

qapp = q yy,F) = Ex 0 - F Q yy 0, h F(t 0)) = - X Q yyi, h F (ti))

n i=1

the apparent error of hf. The difference

r(f, F) = q(F, F)- q(F, F)

is the excess error of hf. The expected excess error is

r = EP~FR (F, F),

where the expectation is taken over F, which is obtained from x1,. . . , xn generated by F. In Section 4, I will clarify the distinction between excess error and expected excess error. I will consider estimates of the expected excess error, although what we would rather have are estimates of the excess error.

I will consider three estimates (the bootstrap, the jackknife, and crossvalidation) of the expected excess error. The bootstrap procedure for estimating r = EF~FR(F, F) replaces F with F. Thus

Foot = ef ,~f r (f *, F),

where F* is the empirical distribution function of a random sample x*,..., x* from F. Since F is known, the expectation can in principle be calculated. The calculations are usually too complicated to perform analytically, however, so we resort to Monte Carlo methods.

1. Generate x*, . . . , x*, a random sample from F. Let F* be the empirical distribution of x*, ..., x*.

2. Construct h F*, the realized prediction rule based on x*, ..., x*.

3. Form

APPENDIX B EXCESS ERROR ESTIMATION IN FORWARD LOGISTIC REGRESSION 175

R* = q (*, F)- q (*, F*)

1 n 1 n

= -XQ(j;,ni*{ti))-nXOh*, h^h*)) (2-1)

4. Repeat 1-3 a large number R times to get R*,R*. The bootstrap estimate of expected excess error is

See Efron (1982) for more details.

The jackknife estimate of expected excess error is

rjack = (n - !)((.) - R),

where F() is the empirical distribution function of (xi,. . ., x;-1, xi+1, . . ., xn), and

Efron (1982) showed that the jackknife estimate can be reexpressed as

Let the training sample omit patients one by one. For each omission, apply the prediction rule to the remaining sample and count the number (0 or 1) of errors that the realized prediction rule makes when it predicts the omitted patient. In total, we apply the prediction rule n times and predict the outcome of n patients. The proportion of errors made in these n predictions is the cross-validation estimate of the error rate and is the first term on the right-hand side. [Stone (1974) is a key reference on cross-validation and has a good historical account. Also see Geisser

The cross-validation estimate of expected excess error is

1 n 1 n

rcross = - Y Q ((, hr (-'(ti )) X Q hi , hi (ti )).

(1975).]

176 APPENDIX B EXCESS ERROR ESTIMATION IN FORWARD LOGISTIC REGRESSION

3. CHRONIC HEPATITIS: AN EXAMPLE

We now discuss a real prediction rule. From 1975 to 1980, Peter Gregory (personal communication, 1980) of Stanford Hospital observed n = 155 chronic hepatitis patients, of which 33 died from the disease. On each patient were recorded p = 19 covariates summarizing medical history, physical examinations, X rays, liver function tests, and biopsies. (Missing values were replaced by sample averages before further analysis of the data.) An effective prediction rule, based on these 19 covariates, was desired to identify future patients at high risk. Such patients require more aggressive treatment.

Gregory used a prediction rule based on forward logistic regression. We assume x1 = (t1, yi),. . . , xn = (tn, yn) are independent and identically distributed such that conditional on t, yi is Bernoulli with probability of

success 6(t), where logit 6(t i) = A, + tb, and where A is a column vector of p elements. If (/J0, A) is an estimate of (/0, A), then 6 (t0), such that logit 6 (t0) = A, + t0 A, is an estimate of 6(t0). We predict death if the estimated probability 6(t0) of death were greater than -.:

hF(to) = 1 if 6(to)> 2, i.e., Ao +1oA > 0

= 0 otherwise. (3.1)

Gregory’s rule for estimating (/0, A) consists of three steps.

1. Perform an initial screening of the variables by testing H0: bj = 0 in the simple logistic model, logit 0(t0) = b + t0jbj, for j = 1, ..., p separately at level a = 0.05. Retain only those variables j for which the test is significant. Applied to Gregory’s data, the initial screening retained 13 variables, 17, 12, 14, 11, 13, 19, 6, 5, 18, 10, 1, 4, 2, in increasing order of p-values.

2. To the variables that were retained in the initial screening, apply forward logistic regression that adds variables one at a time in the following way. Assume variables ji, j2,..., jP are already added to the model. For each remaining j, test H0: bj = 0 in the linear logistic model that contains variables j1, j2,..., jp1, j together with the intercept. Rao’s (1973, pp. 417-420) efficient score test requires calculating the maximum likelihood estimate only under H0. If the most significant variable is significant at a = 0.05, we add that variable to the model as variable jP +1 and start again. If none of the remaining variables is significant at a = 0.05, we stop. From the aforementioned 13 variables, forward logistic regression applied to Gregory’s data chose four variables (17, 11, 14, 2) that are, respectively, albumin, spiders, bilirubin, and sex.

**72**> 73 74 75 76 77 78 .. 90 >> Next