Sample Question #284 (programming case question)
You’re working with a large set of data purchased from an outside vendor. You need to write a program to check and clean the data. (For example, the data could be daily quotes for all the stocks in the world.)
- How do you check the data for errors? In other words, how do you check whether a price is "bad"?
- How do you clean the data of such errors? (Do you delete the "bad" prices, or do you do something else?)
[Lest you think I pulled this case question out of my lazy ass, let me assure you that this is a real-life interview question — and not an unpopular one]
Advertisements
Good question. This reminds me of the data a experimental scientist (e.g., experimental physicist) has to grapple with. There are two broad types of error that creep into the data: Random errors and Systematic errors.With random errors there is little one can do about. The mean of random errors is almost zero. However, systematic errors, whose mean may not be zero, can be discovered and corrected or compensated for. Systematic errors are due to faulty assumptions, whether originating from a human source or a computer program, whose effect propagate through the production of the data. In order to tell whether a set of data suffers from serious errors, we need to compare it with some standards of clean (relatively) errorless set of data. The systematic data evince certain patterns that need to be discovered, and a programmer, like a detective, needs to specify his/her hunches and write code for the discovery of such patterns. As the data are subjected to these codes, none of the data should be discarded!