What are some of the ways by which we can clean a dataset of outliers? Name at least three.

Tougher: If the dataset is huge, say with over 100 million observations, what would you do to identify and filter out the outliers?

(Hint: what is an "outlier"?)

    1) Visual inspection: plot the data and just eyeball data points that lie far away from the majority of the data.
    2) Divide data into percentiles and identify cutoff thresholds beyond which data values are too "extreme"
    3) Repeatedly calculate standard deviation until it’s "stable" from one sub-sample to the next 

