Sample Question #229 (statistics)

Let’s assume you have a good random sample of data that you can comfortably run OLS on. However, due to a coding bug, you accidentally duplicate every observation twice in the dataset you feed into your OLS. For example, if the original data had been:

35 21 2 732 38 1 -293 74 10 61…

where the first number on each line is the dependent variable and the rest are the independent variables, the erroneous dataset looks like:

35 21 2 735 21 2 732 38 1 -232 38 1 -293 74 10 6193 74 10 61…

(As you can see, every original input line was duplicated.)

When you run OLS on this faulty dataset, what happens to all the regression estimates and statistics? Have they changed? If so, for larger or smaller?

[A real phone interview question I was given]

Advertisements

ANSWER

The biggest problem is with the t-stats of the coefficient estimates. The sample variance of the erroneous dataset is the same as the original dataset, but because we have twice as much data, the t-stats will have approximately doubled! Obviously this is not good.

Coefficient estimates and R-squared will remain the same.