Data Mining and Statistics: What’s the connection?

This 1997 paper by Jerome H. Friendman questions the relevance of the emerging discipline of Data Mining to Statistics.

As someone who has come to Statistics from a data processing background, I see it as a confirmation how important a solid understanding of Statistics is in Data Mining, or Business Intelligence, or Big Data, or whatever other buzzword a marketer can conjure.

There are a lot of great quotes in this paper.

A difference between statisticians and computer scientists in this field seems to be that when a statistician has an idea he or she writes a paper; a computer scientist starts a company.

Slightly tongue-in-cheek, though technological understanding and entrepreneurial spirit seem to correlate.

DM can be done with ROLAP [Relational On-Line Analytical Processing] but it requires a sophisticated (domain knowledge) user who (according to Parsaye) “does not sleep or age”.

Around deadline time, sleep deprevation is familiar. But you definitely feel older after it.

A favorite quote of Chuck Dickens (former Director of Computing at SLAC) over the yeras has been “Every time computing power increases by a factor of ten we should totally rethink how and what we compute.” A corollary to this might be “Every time the amount of data increases by a factor of ten, we should totally rethink how we analyze it”.

We now have enough data to answer questions that were solve by intuition in the past, if only we can work out how to formulate the question.

As Brad Efron reminds us: “Statistics has been the most successful information science.” … “Those who ignore Statistics are condemned to reinvent it.”

However, as Jerome explains, its no longer the only information science. There are plenty of information workers without a good grasp of statistics, who either ignore it completely or fumble to replicate it. I want to do something about my own ignorance here.

John Tukey [Tukey (1962)] holds that Statistics ought to be concerned with data analysis. The field should be defined in terms of a set of problems rather than a set of tools, namely those problems that pertain to data.

I believe that everything variable is usefully expressed as data. I infer from this that statistics can be used to analyze just about everything.

… DM paradigms may also require modification. The DM community may have to moderate its romance with ‘big’. A prevailing attitude seems to be that unless an analysis involves gigabytes or terabytes of data, it cannot possibly be worthwhile.

Traditional tansactional and analytical processingexhibits problems at scale that just don’t appear in small datasets Big data does force you to innovate. However, big is a relative term; it’s all just data. Big or small, the right analysis can produce a useful result.

Sampling methodology … can be profitably used to improve accuracy while mitigating computational requirements. … A powerful computationally intense procedure may in fact produce superior accuracy than a less sophisticated one using the entire data base.

Restated: bigger is not always better. Choose the right algorithm instead of blindly throwing more compute or more data at the problem.

Advertisements

Restarting SQL Server to fix msdb

My laptop screwed up badly today. After several failed restarts, my desktop was stable and somewhat responsive, so I tried to continue normally.

SSMS’s Object Explorer is now broken. Expanding any node produces this error message:

An exception occurred while executing a Transact-SQL statement or batch. (Microsoft.SqlServer.ConnectionInfo)

Database ‘msdb’ cannot be opened due to inaccessible files or insufficient memory or disk space. See the SQL Server errorlog for details. (Microsoft SQL Server, Error: 945)

To fix this, I had to restart SQL Server.

Start a new PowerShell session as Administrator and run the following:

Stop-Service MSSQLSERVER -Force
Start-Service MSSQLSERVER

Restart SSMS and the Object Explorer will work again.

Fixing Notepad++’s “Load langs.xml failed!” Error

“Load langs.xml failed!” is a really annoying error from Notepad++ that means syntax highlighting is broken.

langs.xml is a file Notepad++ uses to define the syntax of supported languages. Copies of the file live in more than one place.

Sean Walther’s instructions to fix the “Load langs.xml failed!” message have saved me once. In short, delete langs.xml from the Notepad++ application folder and replace it with a copy of langs.model.xml from the same folder before restarting Notepad++.

Today I had to try a different fix. Start a new cmd session and run the following command:

del %USERPROFILE%\AppData\Roaming\Notepad++\langs.xml

This deletes a copy of langs.xml from your profile’s Notepad++ configuration.

When you restart Notepad++, it should recreate a valid langs.xml in your profile, and syntax highlighting should work again.

Working with multiple versions of SMO in PowerShell

 

There are lots of ways to load the SMO assembly in PowerShell.

You can use the PowerShell V1 method:

[void] [Reflection.Assembly]::LoadWithPartialName('Microsoft.SqlServer.Smo');

This loads the most recent version of the assembly. It doesn’t provide a way to specify a version. For these reasons, the method has been obsoleted in the .NET API for a long time.

The PowerShell V2 way is to use the Add-Type cmdlet. According to Max Trinidad, this works well in an environment with one version of SQL Server:

Add-Type -AssemblyName 'Microsoft.SqlServer.Smo';

I don’t work in such an environment, so the above command fails with an error.

You can specify the version to the Add-Type cmdlet, but you also have to specify the Culture and the PublicKeyToken of the underlying assembly.

I don’t know much about how .NET assemblies work. I do know that for this particular assembly, the Culture is always neutral, and the PublicKeyToken is always 89845dcd8080cc91.

Because the only attribute that varies is the version number, we can use this simple function to load an assembly of a specific version:

function Add-Smo ($Version) {

  Add-Type -AssemblyName "Microsoft.SqlServer.Smo, Version=$Version, Culture=neutral, PublicKeyToken=89845dcd8080cc91";

};

 

Add-Smo -Version '10.0.0.0';

 

New-Object 'Microsoft.SqlServer.Management.Smo.Server';

In my environment, the first few lines of output of the above script looks like this:

 

 

AuditLevel                  : Failure

BackupDirectory             : C:\Program Files\Microsoft SQL Server\MSSQL10.MSSQLSERVER\MSSQL\Backup

BrowserServiceAccount       : NT AUTHORITY\LOCAL SERVICE

BrowserStartMode            : Disabled

BuildClrVersionString       : v2.0.50727

BuildNumber                 : 5500

Collation                   : Latin1_General_CI_AS

CollationID                 : 53256

ComparisonStyle             : 196609

...

In my environment, the valid values of the Version parameter are ‘10.0.0.0’ and ‘11.0.0.0’. For you, it may vary.

To use this in another script, you might just want to copy the function body and hard-code the version number you need.

I don’t know how to create PowerShell function libraries yet, but I might like to add this to my own library of SQL Server functions.

RFC 5646: Tags for Identifying Languages

ISO 639 language codes are great, but they are not enough to specify all the cultural and regional differences that distinguish a language.

The ISO 639-1 code ‘en’ means English. But there are many forms of English, both de jure and de facto. How do you distinguish among British English, American English, Candian English, Indian English, and Austrialian English?

RFC 5646 is a standard for identifying not just languages like English (en), French (fr), but also the cultural things that might make them distinct, like country or script. For example, you can represent Brazilian Protuguese usinh pt-BR, and Russian written in a Latin script using ru-Latn.

Microsoft uses codes like these in its own products and calls them culture codes.

It’s a big document, though. Might take a while to understand.