# Common Errors in Statistics and How to Avoid Them - Good P.I

**Download**(direct link)

**:**

**70**> 71 72 73 74 75 76 .. 90 >> Next

In the second regression, the t statistic for testing whether a coefficient vanishes is asymptotically distributed as

These results may be interpreted as follows. The number of variables in the first-pass regression is p = pn + o(n); the number in the second pass is q„a = apn + o(n). That is, as may be expected, a of the variables are significant at level a. Since g(1) < 1, the R2 in the second-pass regression is essentially the fraction g(1) of R2 in the first pass. Likewise, g(1) > a, so the asymptotic value of the F statistic exceeds 1. Since the number of degrees of freedom is growing, off-scale P values will result. Finally, the real level of the t test may differ appreciably from the nominal level.

Example. Suppose N = 100 and p = 50, so p = -; and a = 0.25 so 1 = 1.15. Then g(1) = 0.72, and E{Z2| |Z| > 1} = 2.9. In a regression with 50 explanatory variables and 100 data points, on the null hypothesis R2 should be nearly -.

Next, run the regression again, keeping only the variables significant at the 25 percent level. The new R2 should be around g (1) = 72 percent of the original R2. The new F statistic should be around

(10)

168 APPENDIX A A NOTE ON SCREENING REGRESSION EQUATIONS

gW /1-gU)p=40 a / 1 - ap ’

The number of degrees of freedom should be around apn = 12 in the numerator and 100 - 12 = 88 in the denominator. (However, qn,a is still quite variable, its standard deviation being about 3.) On this basis, a P value on the order of 10-4 may be anticipated.

What about the t tests? Take l' > l, corresponding to level a' < a. The nominal level for the test is a', but the real level is

Since g(l) > a, it follows that 1 - ap > 1 - g(1)p. Keep a = 0.25, so l = 1.15; take a' = 5 percent, so l' = 1.96; keep p = -. Now

and the real level is 9 percent. This concludes the example.

Turn now to the proofs. Without loss of generality, suppose the ith column of X has a 1 in the ith position and 0’s everywhere else. Then

and the sum of squares for error in the first-pass regression corresponding to the model (1) is

Now (3) follows from the weak law of large numbers. Of course, E(R2n) and var Rn are known: see Kendall and Stuart (1969).

?i = Yi for i = 1,..., p,

n

i = p+i

Thus

and

APPENDIX A A NOTE ON SCREENING REGRESSION EQUATIONS 169

To prove (10), the t statistic for testing bi = 0 is Y/sn, where

Thus, column i of X enters the second regression iff | Y/sn| > ta,n-p, the cutoff for a two-tailed t test at level a, with n - p degrees of freedom.

In what follows, suppose without loss of generality that a2 = 1. Given sn, the events

are conditionally independent, with common conditional probability F(ta n-psn). Of course, ta,n-p ^ l and sn ^ 1; so this conditional probability converges to F(l) = a. The number qna of the events Ai that occur is therefore

by (2). This can be verified in detail by computing the conditional expectation and variance.

Next, condition on sn and qn,a = q and on the identity of the q columns going into the second regression. By symmetry, suppose that it is columns 1 through q of X that enter the second regression. Then

It remains only to estimate Sq=1Y2, to within o(n). However, these Y’s are conditionally independent, with common conditional distribution: they are distributed as Z given |Z| > zn, where Z is N(0, 1) and zn = tan-p ? sn. In view of (5), the conditional expectation of SUY2 is

Ai {|Y;| > ta psn }

ap + o(p ) = apn + o(n )

and

Now Si=iY2 = n + o(n); and in the denominator of Fna,

n

n

? Y2 =?Y2-?Y2.

i=q+1 i=1 i=1

170 APPENDIX A A NOTE ON SCREENING REGRESSION EQUATIONS

qn a g (zn )If{z„ ).

But qn,a = apn + o(n) and zn ^ l. So the last display is, up to o(n),

apng(X)/a = g (l)pn.

Likewise, the conditional variance of Sq=1Y2 is qn,a {2 + n(zn)} = O(n); the

conditional standard deviation is O (Vn). Thus

Y Y 2 = g (1)pn + o(n),

i=1

1 n

- Y Y 2 = g (l)/a + o(l),

q i=i

— Y Y2 = 1 - g(1)p + o(l). n - q - 1 - aP

This completes the argument for the convergence in probability. The

assertion about the t statistic is easy to check, using the last display.

REFERENCES

Diaconis P; Freedman D. “On the Maximum Difference Between the Empirical and Expected Histograms for Sums,” Pacific Journal of Mathematics, 1982; 100:287-327.

Freedman D. “Some Pitfalls in Large-Scale Econometric Models: A Case Study,” University of Chicago Journal of Business, 1981; 54:479-500.

Freedman D; Rothenberg T; Sutch R. “A Review of a Residential Energy End Use Model,” Technical Report No. 14, University of California, Berkeley, Dept. of Statistics, 1982.

“On Energy Policy Models,” Journal of Business and Economic Statistics,

1983; 1:24-32.

Judge G; Bock M. The Statistical Implications of Pre-Test and Stein-Rule Estimators in Econometrics, Amsterdam: North-Holland, 1978.

Kendall MG; Stuart A. The Advanced Theory of Statistics, London: Griffin, 1969.

Olshen RA. “The Conditional Level of the E-Test,” Journal of the American Statistical Association., 1973; 68, 692-698.

**70**> 71 72 73 74 75 76 .. 90 >> Next