I read a recent interview with Hadley Wickham. Two things stood out to me.The first is how down-to-earth he seems, even given how well-known he is in the data science community.The second was this quote:Big data problems [are] actually small data problems, once you have the right subset/sample/summary. Inventing numbers on the spot, I’d say 90% of big data problems fall into this category.##The technical side of samplingEven if you don’t have huge data sets (defined for me personally as anything over 10GB or 5 million rows, whichever comes first), you usually run into issues where even a fast computer will process too slowly in memory (especially if you’re using R). It will go even slower if you’re processing data remotely, as is usually the case with pulling down data from relational databases (I’m not considering Hadoop or other NoSQL solutions in this post, since they’re a different animal entirely.)In the cases where pulling down data takes longer than running regressions on it, you’ll need to sample.But how big a sample is big enough? As I’ve been working through a couple of rounds of sampling lately, I’ve found that there’s no standard rule of thumb, either in the data science community or in specific industries like healthcare and finance. The answer is, as always, “It depends.”.Before the rise of computer-generated data collection, statisticians used to have to work up to a large-enough sample. The question was, “I have to collect a lot of data. The process of collecting data, usually through paper surveys, will take a long time and be extremely expensive. How much data is enough to be accurate?”Today, the question is the opposite: “How much of this massive amount of data that we’ve collected can we throw out and still be accurate?”That was the question I was trying to answer a couple weeks ago when I was working with a SQL table that had grown to 1+ billion rows.The business side of samplingTo understand the relationship between an entire population and a s...
First seen: 2025-05-31 04:26
Last seen: 2025-05-31 10:26