Data Mining and Statistics: What’s the connection?

This 1997 paper by Jerome H. Friendman questions the relevance of the emerging discipline of Data Mining to Statistics.

As someone who has come to Statistics from a data processing background, I see it as a confirmation how important a solid understanding of Statistics is in Data Mining, or Business Intelligence, or Big Data, or whatever other buzzword a marketer can conjure.

There are a lot of great quotes in this paper.

A difference between statisticians and computer scientists in this field seems to be that when a statistician has an idea he or she writes a paper; a computer scientist starts a company.

Slightly tongue-in-cheek, though technological understanding and entrepreneurial spirit seem to correlate.

DM can be done with ROLAP [Relational On-Line Analytical Processing] but it requires a sophisticated (domain knowledge) user who (according to Parsaye) “does not sleep or age”.

Around deadline time, sleep deprevation is familiar. But you definitely feel older after it.

A favorite quote of Chuck Dickens (former Director of Computing at SLAC) over the yeras has been “Every time computing power increases by a factor of ten we should totally rethink how and what we compute.” A corollary to this might be “Every time the amount of data increases by a factor of ten, we should totally rethink how we analyze it”.

We now have enough data to answer questions that were solve by intuition in the past, if only we can work out how to formulate the question.

As Brad Efron reminds us: “Statistics has been the most successful information science.” … “Those who ignore Statistics are condemned to reinvent it.”

However, as Jerome explains, its no longer the only information science. There are plenty of information workers without a good grasp of statistics, who either ignore it completely or fumble to replicate it. I want to do something about my own ignorance here.

John Tukey [Tukey (1962)] holds that Statistics ought to be concerned with data analysis. The field should be defined in terms of a set of problems rather than a set of tools, namely those problems that pertain to data.

I believe that everything variable is usefully expressed as data. I infer from this that statistics can be used to analyze just about everything.

… DM paradigms may also require modification. The DM community may have to moderate its romance with ‘big’. A prevailing attitude seems to be that unless an analysis involves gigabytes or terabytes of data, it cannot possibly be worthwhile.

Traditional tansactional and analytical processingexhibits problems at scale that just don’t appear in small datasets Big data does force you to innovate. However, big is a relative term; it’s all just data. Big or small, the right analysis can produce a useful result.

Sampling methodology … can be profitably used to improve accuracy while mitigating computational requirements. … A powerful computationally intense procedure may in fact produce superior accuracy than a less sophisticated one using the entire data base.

Restated: bigger is not always better. Choose the right algorithm instead of blindly throwing more compute or more data at the problem.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s