Download (direct link):
Rencher AC; Pun FC. Inflation of R2 in Best Subsets Regression, Technometrics, 1980; 22:49-53.
APPENDIX A A NOTE ON SCREENING REGRESSION EQUATIONS 171
Cross-Validation, the Jackknife, and the Bootstrap: Excess Error Estimation in Forward Logistic Regression
Given a prediction rule based on a set of patients, what is the probability of incorrectly predicting the outcome of a new patient? Call this probability the true error. An optimistic estimate is the apparent error, or the proportion of incorrect predictions on the original set of patients, and it is the goal of this article to study estimates of the excess error, or the difference between the true and apparent errors. I consider three estimates of the excess error: cross-validation, the jackknife, and the bootstrap. Using simulations and real data, the three estimates for a specific prediction rule are compared. When the prediction rule is allowed to be complicated, overfitting becomes a real danger, and excess error estimation becomes important. The prediction rule chosen here is moderately complicated, involving a variable-selection procedure based on forward logistic regression.
KEY WORDS: Prediction; Error rate estimation; Variables selection.
A common goal in medical studies is prediction. Suppose we observe n patients, x1 = (t1, y1),. . ., xn = (tn, yn), where yp is a binary variable indicating whether or not the 2th patient dies of chronic hepatitis and ti is a vector of explanatory variables describing various medical measurements
* Gail Gong is Assistant Professor, Department of Statistics, Carnegie-Mellon University, Pittsburgh, PA 15217.
Reprinted with permission by the American Statistical Association.
APPENDIX B EXCESS ERROR ESTIMATION IN FORWARD LOGISTIC REGRESSION 173
on the 2th patient. These n patients are called the training sample. We apply a prediction rule h to the training sample x = (x1,. . ., xn) to form the realized prediction rule hx. Given a new patient whose medical measurements are summarized by the vector t0, we predict whether or not he will die of chronic hepatitis by hx(i0), which takes on values death or not death. Allowing the prediction rule to be complicated, perhaps including transforming and choosing from many variables and estimating parameters, we want to know: What is the error rate, or the probability of predicting a future observation incorrectly?
A possible estimate of the error rate is the proportion of errors that hx makes when applied to the original observations xi,. . ., xn. Because the same observations are used for both forming and assessing the prediction rule, this proportion, which I call the apparent error, underestimates the error rate.
To correct for this bias, we might use cross-validation, the jackknife, or the bootstrap for estimating excess errors (e.g., see Efron 1982). We study the performance of these three methods for a specific prediction rule. Excess error estimation is especially important when the training sample is small relative to the number of parameters requiring estimation, because the apparent error can be seriously biased. In the chronic hepatitis example, if the dimension of ti is large relative to n, we might use a prediction rule that selects a subset of the variables that we hope are strong predictors. Specifically, I will consider a prediction rule based on forward logistic regression. I apply this prediction rule to some chronic hepatitis data collected at Stanford Hospital and to some simulated data. In the simulated data, I compare the performance of the three methods and find that cross-validation and the jackknife do not offer significant improvement over the apparent error, whereas the improvement given by the bootstrap is substantial.
A review of required definitions appears in Section 2. In Section 3, I discuss a prediction rule based on forward logistic regression and apply it to the chronic hepatitis data. In Sections 4 and 5, I apply the rule to simulated data. Section 6 concludes.
I briefly review the definitions that will be used in later discussions. These definitions are essentially those given by Efron (1982). Let x1 = (t1, y1),..., xn = (tn, yn) be independent and identically distributed from an unknown distribution F, where ti is a ^-dimensional row vector of real-valued explanatory variables and y2 is a real-valued response. Let F be the empirical distribution function that puts mass 1/n at each point x1,. . ., xn. We apply a prediction rule h to this training sample and form the realized prediction
174 APPENDIX B EXCESS ERROR ESTIMATION IN FORWARD LOGISTIC REGRESSION
rule hjp(to)- Let Q(y0, hf(t0)) be the criterion that scores the discrepancy between an observed value y0 and its predicted value hf(t0). The form of both the prediction rule h and the criterion Qare given a priori. I define the true error of hF to be the expected error that hF makes on a new observation x0 = (t0, y0) from F,