Download (direct link):
Independent verification also can be obtained through the use of surrogate or proxy variables. For example, we may want to investigate past climates and test a model of the evolution of a regional or worldwide climate over time. We cannot go back directly to a period before direct measurements on temperature and rainfall were made, but we can observe the width of growth rings in long-lived trees or measure the amount of carbon dioxide in ice cores.
Splitting the sample into two parts—one for estimating the model parameters, the other for verification—is particularly appropriate for validating time series models where the emphasis is on prediction or reconstruction. If the observations form a time series, the more recent observations should be reserved for validation purposes. Otherwise, the data used for validation should be drawn at random from the entire sample.
Unfortunately, when we split the sample and use only a portion of it, the resulting estimates will be less precise.
Browne  suggests we pool rather than split the sample if:
1 For examples and discussion of AutoRegressive Integrated Moving Average processes, see Brockwell and Davis .
CHAPTER 11 VALIDATION 157
(a) The predictor variables to be employed are specified beforehand (that is, we do not use the information in the sample to select them).
(b) The coefficient estimates obtained from a calibration sample drawn from a certain population are to be applied to other members of the same population.
The proportion to be set aside for validation purposes will depend upon the loss function. If both the goodness-of-fit error in the calibration sample and the prediction error in the validation sample are based on mean-squared error, Picard and Berk  report that we can minimize their sum by using between one-fourth and one-third of the sample for validation purposes.
A compromise proposed by Moiser  is worth revisiting: The original sample is split in half; regression variables and coefficients are selected independently for each of the subsamples; if they are more or less in agreement, then the two samples should be combined and the coefficients recalculated with greater precision.
A further proposal by Subrahmanyam  to use weighted averages where there are differences strikes us as equivalent to painting over cracks left by the last earthquake. Such differences are a signal to probe deeper, to look into causal mechanisms, and to isolate influential observations that may, for reasons that need to be explored, be marching to a different drummer.
We saw in the report of Gail Gong , reproduced in Appendix B, that resampling methods such as the bootstrap may be used to validate our choice of variables to include in the model. As seen in last chapter, they may also be used to estimate the precision of our estimates.
But if we are to extrapolate successfully from our original sample to the population at large, then our original sample must bear a strong resemblance to that population. When only a single predictor variable is involved, a sample of 25 to 100 observations may suffice. But when we work with n variables simultaneously, sample sizes on the order of 25n to 100n may be required to adequately represent the full n-dimensional region.
Because of dependencies among the predictors, we can probably get by with several orders of magnitude fewer data points. But the fact remains that the sample size required for confidence in our validated predictions grows exponentially with the number of variables.
Five resampling techniques are in general use:
1. K-fold, in which we subdivide the data into K roughly equal-sized parts, then repeat the modeling process K times, leaving one section out each time for validation purposes.
158 PART III BUILDING A MODEL
2. Leave-one-out, an extreme example of K-fold, in which we subdivide into as many parts as there are observations. We leave one observation out of our classification procedure and use the remaining n - 1 observations as a training set. Repeating this procedure n times, omitting a different observation each time, we arrive at a figure for the number and percentage of observations classified correctly. A method that requires this much computation would have been unthinkable before the advent of inexpensive readily available high-speed computers. Today, at worst, we need step out for a cup of coffee while our desktop completes its efforts.
3. Jackknife, an obvious generalization of the leave-one-out approach, where the number left out can range from one observation to half the sample.
4. Delete-d, where we set aside a random percentage d of the observations for validation purposes, use the remaining 100 - d% as a training set, and then average over 100 to 200 such independent random samples.
5. The bootstrap, which we have already considered at length in earlier chapters.
The correct choice among these methods in any given instance is still a matter of controversy (though any individual statistician will assure you the matter is quite settled). See, for example, Wu  and the discussion following and Shao and Tu .