Base Table

Iain Elder's data development blog

Ad hoc update to system catalogs is not supported.

Today someone handed over to me a server that was partially set up so that I could finish the task.

One thing yet to configure was the Database Mail settings. We use execute script on each new server to configure Database Mail so that it can send server alerts to our inboxes.

The first part of the script is simple enough. It reconfigures SQL Server to enable the database mail stored procedures.

The batch looks like this:

EXEC msdb.dbo.sp_configure
@configname = ‘Database Mail XPs’,
@configvalue = 1;

RECONFIGURE;

Today the RECONFIGURE statement failed with an error I’ve never seen before. The output of the batch looked like this:

Configuration option ‘Database Mail XPs’ changed from 0 to 1. Run the RECONFIGURE statement to install.
Msg 5808, Level 16, State 1, Line 2
Ad hoc update to system catalogs is not supported.

You normally see an error like this when you try to do something silly like directly modify the system tables:

DELETE FROM sys.objects;

The server will respond with a similar but distinct error:

Msg 259, Level 16, State 1, Line 1
Ad hoc updates to system catalogs are not allowed.

This error is a good thing. It protects us from making careless updates that trash the system.

But why should I see this error when I call sp_configure? By using the procedure, I’m asking the server to modify system tables on my behalf.

Nakul Vachhrajani at [Beyond Relational](http://beyondrelational.com/modules/2/blogs/77/posts/11282/ad-hoc-update-to-system-catalogs-is-not-supported-when-using-reconfigure.aspx) has faced the same problem, and documented how to fix it.

There is a deprecated SQL Server setting called ‘[allow updates](http://msdn.microsoft.com/en-us/library/ms179262.aspx)’ that is set to 0 by default. When it is set to 1, it causes the RECONFIGURE statement to fail.

To fix the issue, I executed this batch on the server:

EXECUTE sys.sp_configure
@configname = ‘allow updates’,
@configvalue = 0;

RECONFIGURE;

It produced the usual output message, and no errors:

Configuration option ‘allow updates’ changed from 1 to 0. Run the RECONFIGURE statement to install.

Why didn’t the reconfigure statement raise an error this time? I’m unsure, but it looks like the effect of sp_configure takes place immediately for this setting.

Querying sys.configurations shows ‘allow updates’ is now set to 0:

SELECT *
FROM sys.configurations
WHERE name = N’allow updates’;

configuration_id name value minimum maximum value_in_use description is_dynamic is_advanced
102 allow updates 0 0 1 0 Allow updates to system tables 1 0

Now I can set up Database Mail!

Warehouse-Scale Computing

Google Distinguished Engineer Luiz André Berroso talks first to Sigmod and later the ACM about Warehouse-Scale Computing.

You often hear how Google has multiple massive data centers across the world. He prefers to think of each one as a massive computer supporting a range of Internet services.

He introduces the concept of PUE (Power Use Effectiveness), a ratio that Google uses to measure how power efficient its computers are. It’s the amount of power spent on computation over the total amount of power spent on the computer.

I’ve seen all the Sigmod talk and I’m about halfway through the ACM talk.

Some good quotes (slightly paraphrased):

“I’m not sure what the cloud is, but I think it’s got something to do with my talk.”

“If you have things at the petabyte scale, you can think about them as a data center. A friend of mine recently got paged by the system because there were only a few petabytes of space left. That’s the kind of warehouse scale we’re talking about here.”

Ola Hallengren’s Maintenance Solution

A colleague asked me today to help him solve a maintenance issue that is affecting performance in the data warehouse.

Our server’s indexes are heavily fragmented and our statistics are months out of date in some places. Slow data retrieval makes it more difficult for the BI team to meet its operational KPIs because it takes longer to process data.

Well-organized indexes and fresh statistics are very important for quick data retrieval.

Without regular maintenance of these objects, SQL Server still works, but it eventually retrieves data very slowly as it rummages around the messy data. Think of trying to find a book in a library when all the books are scattered on the floor.

So what’s a busy DBA to do? Leverage the work of others: use Ola Hallengren’s Maintenance Solution!

Intelligent Index Maintenance

The SQL Server Maintenance Solution lets you intelligently rebuild or reorganize only the indexes that are fragmented. In the IndexOptimize procedure, you can define a preferred index maintenance operation for each fragmentation group. Take a look at this code:

EXECUTE dbo.IndexOptimize @Databases = ‘USER_DATABASES’,

@FragmentationLow = NULL,

@FragmentationMedium = ‘INDEX_REORGANIZE,INDEX_REBUILD_ONLINE,INDEX_REBUILD_OFFLINE’,

@FragmentationHigh = ‘INDEX_REBUILD_ONLINE,INDEX_REBUILD_OFFLINE’,

@FragmentationLevel1 = 5,

@FragmentationLevel2 = 30

In this example, indexes that have a high fragmentation level will be rebuilt, online if possible. Indexes that have a medium fragmentation level will be reorganized. Indexes that have a low fragmentation level will remain untouched.

Update Statistics

The IndexOptimize procedure can also be used to update statistics. You can choose to update all statistics, statistics on indexes only, or statistics on columns only. If an index is rebuilt, the statistics is not being updated. You can also choose to update the statistics only if any rows have been modified since the most recent statistics update.

EXECUTE dbo.IndexOptimize @Databases = ‘USER_DATABASES’,

@FragmentationLow = NULL,

@FragmentationMedium = ‘INDEX_REORGANIZE,INDEX_REBUILD_ONLINE,INDEX_REBUILD_OFFLINE’,

@FragmentationHigh = ‘INDEX_REBUILD_ONLINE,INDEX_REBUILD_OFFLINE’,

@FragmentationLevel1 = 5,

@FragmentationLevel2 = 30,

@UpdateStatistics = ‘ALL’,

@OnlyModifiedStatistics = ‘Y’

Ola Hallengren is a SQL Server DBA working for a Scandinavian bank. He gave a talk about his maintenance solution at SQLBits a couple of years ago.

I’m keen to try it out. If it’s good enough for all these companies that you’ve possibly heard of, then it’s good enough for me.

Plus, it’s free!

Don’t install Perl in “Program Files”

The Perl module HTML::TableExtract looks good for extracting table data from HTML tables.

I just need to install Perl to try it out. On Windows, I have a choice between Strawberry Perl and Active Perl.

Strawberry Perl actively forbids you from installing it in a directory whose full path contains “spaces or non-ASCII” characters. Try to override the default C:\strawberry with a more Windows-friendly C:\Program Files\Strawberry Perl to see an error like this:

"Stawberry Perl (64-bit) cannot be installed in a directory with spaces or non-ASCII characters.

After scoffing at this and installing ActivePerl, I now understand why Strawberry Perl was so defensive.

ActivePerl installed to C:\Program Files\ActivePerl with no complaint. Everything was going totally smoothly, until the time came to install the killer app using cpan.

Set up gcc environment - gcc.exe (rubenvb-4.5.4) 4.5.4
Checking if your kit is complete...
Looks good
Unable to find a perl 5 (by these names: C:\Program Files\ActivePerl\bin\perl.ex
e perl.exe perl5.exe perl5.16.1.exe miniperl.exe, in these dirs: . C:\Program Fi
les\ActivePerl\site\bin C:\Program Files\ActivePerl\bin C:\Windows\System32 C:\W
indows C:\Windows\System32\Wbem C:\Windows\System32\WindowsPowershell\v1.0 C:\Pr
ogram Files\TortoiseHg C:\Program Files\TortoiseSVN\bin C:\Program Files\Java\jd
k1.7.0_09\bin C:\Program Files\Python27 C:\Program Files\ActivePerl\bin)
Writing Makefile for HTML-TableExtract
Writing MYMETA.yml and MYMETA.json
dmake.exe:  Error: -- `C:\Program' not found, and can't be made
  MSISK/HTML-TableExtract-2.11.tar.gz
  C:\PROGRA~1\ACTIVE~1\site\bin\dmake.exe -- NOT OK
Running make test
  Can't test without successful make
Running make install
  Make had returned bad status, install seems impossible

Strawberry Perl tried to protect me from what could have been a frustrating evening, but made itself look lame instead of calling out dmake as a flaky tool.

I’m going to try again with ActivePerl.

Remove “Choose an operating system” from startup

I want Windows Server 2012 to start quickly when I boot.

When I turn on my computer I see a screen like this:

Choose an operating system to start, or press tab to select a tool:
(Use the arrow keys to highlight your choice, then press ENTER.)

Windows Server 2012
Continue with install process

Windows Server 2012 starts automatically after 30 seconds. If I don’t want to wait that long I have to press enter.

If I choose Continue with install process, I see a screen like this:

Windows failed to start. A recent hardware or software change might be the cause.
1. Insert your Windows installation disc and restart your computer.
2. Choose language settings, and then chick “Next.”
3. Click “Repair your computer.”

If you do not have this disc, contact your system administrator or computer manufacturer for assistance.

This is all that’s left of almost installing Debian on my laptop. I thought this would be gone when the Windows Server 2012 installer formatted my hard drive, but I guess the boot record is stored in a special place that survives the format.

The rest of this post describes how to remove the screen and start Windows automatically.

Start Command Prompt as Administrator and run bcedit.

C:\Windows\system32>bcdedit

Windows Boot Manager
——————–
identifier              {bootmgr}
device                  partition=\Device\HarddiskVolume1
description             Windows Boot Manager
locale                  en-US
inherit                 {globalsettings}
bootshutdowndisabled    Yes
default                 {current}
resumeobject            {c63af640-219a-11e2-b236-83f141b7b624}
displayorder            {current}
                        {30071991-e553-11df-bc3e-a35baee11a34}
toolsdisplayorder       {memdiag}
timeout                 30

Windows Boot Loader
——————-
identifier              {current}
device                  partition=C:
path                    \Windows\system32\winload.exe
description             Windows Server 2012
locale                  en-US
inherit                 {bootloadersettings}
recoverysequence        {7561fe3c-2159-11e2-93e8-0c60761e3c34}
recoveryenabled         Yes
allowedinmemorysettings 0×15000075
osdevice                partition=C:
systemroot              \Windows
resumeobject            {c63af640-219a-11e2-b236-83f141b7b624}
nx                      OptOut

Real-mode Boot Sector
———————
identifier              {30071991-e553-11df-bc3e-a35baee11a34}
device                  partition=C:
path                    \g2ldr.mbr
description             Continue with install process

Type bcdedit /? to get information about all the options available.

Type bcdedit /? /delete to get more information about deleting an entry.

To delete the Continue with install process entry, type bcdedit /delete {30071991-e553-11df-bc3e-a35baee11a34}.

C:\Windows\system32>bcdedit /delete {30071991-e553-11df-bc3e-a35baee11a34}
The operation completed successfully.

Type bcdedit again to confirm.

C:\Windows\system32>bcdedit

Windows Boot Manager
——————–
identifier              {bootmgr}
device                  partition=\Device\HarddiskVolume1
description             Windows Boot Manager
locale                  en-US
inherit                 {globalsettings}
bootshutdowndisabled    Yes
default                 {current}
resumeobject            {c63af640-219a-11e2-b236-83f141b7b624}
displayorder            {current}
toolsdisplayorder       {memdiag}
timeout                 30

Windows Boot Loader
——————-
identifier              {current}
device                  partition=C:
path                    \Windows\system32\winload.exe
description             Windows Server 2012
locale                  en-US
inherit                 {bootloadersettings}
recoverysequence        {7561fe3c-2159-11e2-93e8-0c60761e3c34}
recoveryenabled         Yes
allowedinmemorysettings 0×15000075
osdevice                partition=C:
systemroot              \Windows
resumeobject            {c63af640-219a-11e2-b236-83f141b7b624}
nx                      OptOut

The output shows Windows Server 2012 as the only entry.

Restart the computer to go through the boot process again.

The computer should now boot Windows without a prompt.

Path Editor

At some point in your career as a developer or an administrator, you’re going to care about the value of the Path environment variable.

For most people, it looks like this:

Edit System Variable Interface

The default tool for manipulating the variable is painful to use. It forces you to manipulate the value as a string, it performs no validation on the input, and it won’t even show you the whole value you’re working with.

I shouldn’t have to manipulate the Path as a string just because it’s serialized as one. Logically, the Path object is a sequence of directories.

With Redfern Place’s Path Editor, you can edit the Windows Path environment variable with an interface that better reflects the logical model.

For me, the Path looks like this:

Path Editor Interface

To insert a new item, context-click below the existing list entries and click ‘Insert New Item’:

Insert_New_Item

Type in the new expression to add to the path:

Type Item Name

Here, I’m adding Java binaries to my Path by using the JAVA_HOME environment variable I defined earlier. Because I’ve referred to the value of another environment variable by name, the shell will evaluate the Path variable at execution time to derive the true absolute path.

This means if I change JAVA_HOME because I change my version of Java, then I just have to update JAVA_HOME and the Path is updated for free!

Path Editor validates my input and shows me a green tick:

New_Directory_is_Valid

So I know I did it right.

Hit ‘Save to Registry’ (it can take a few seconds to finish saving), open a new Command Prompt, and type ‘java’ to test:

C:\Users\Iain>java -version
java version “1.7.0_09″
Java(TM) SE Runtime Environment (build 1.7.0_09-b05)
Java HotSpot(TM) 64-Bit Server VM (build 23.5-b02, mixed mode)

Success!

Use handle to disconnect external storage safely… sometimes

The problem

Yesterday I received my shiny Samsung M3 1TB Portable Hard Drive from Amazon to solve my storage problems (I hoard MP3s). It looks like this:

Samsung M3 1TB Slimline

A static picture does little justice to the case. It’s not covered by a tacky plaid pattern; the surface is all angular. You should see the light reflects beautifully when you wobble it about.

Today I plugged it in to my workstation to transfer a few personal downloads. When I had finished, I played the responsible user and safely disconnected the device before yanking the USB cable from the socket.

In Windows 7, you do that by context-clicking the system tray icon that looks like a USB connector and choosing ‘Eject Samsung M3 Portable’ or whatever matches the name of your device:

Eject_Samsung_M3_Portable_ContextItem

Instead of rewarding me with a signal that I could now yank my device from the socket, Windows gave an impudent error dialog that declares “This device is currently in use”:

Problem_Ejecting_USB_Mass_Storage_Device_Dialog

The dialog indirectly warns me that I could potentially trash data by ejecting the device prematurely. However, the proposed resolution to “Close any programs or windows that might be using the device, and then try again.” is vague, and that sucks.

I multitask a lot at my workstation, and have open files all over the place. I don’t feel like wading through each window to find the one that won’t let go.

A solution

Thankfully, you can make up for the dialog’s shortcomings with Mark Russinovich’s awesome Handle utility. The following steps assume you have put handle.exe in your Windows folder so that it is on the path.

Start a new PowerShell session and use handle to search for processes that have a file open anywhere on the device. For me, the root directory of my device is I:\, so the command and output look like this:

PS Z:\> handle I:\

 

Handle v3.5

Copyright (C) 1997-2012 Mark Russinovich

Sysinternals - www.sysinternals.com

 

System             pid: 4      type: File          2990: I:\$Extend\$RmMetadata\$TxfLog\$TxfLogContainer00000000000000000002

System             pid: 4      type: File          3464: I:\$Extend\$RmMetadata\$Txf

System             pid: 4      type: File          3DB8: I:\$Extend\$RmMetadata\$TxfLog\$TxfLog.blf

System             pid: 4      type: File          3F9C: I:\$Extend\$RmMetadata\$TxfLog\$TxfLogContainer00000000000000000001

WINWORD.EXE        pid: 6748   type: File           100: I:\Iain Elder.docx

After the copyright notice, each line of output represents a file handle – a process that has opened a file.

The last line of output is the useful one here. It means that the file I:\Iain Elder.docx, a copy of my CV, is open in Microsoft Word, whose executable name is WINWORD.EXE.

Make sure you’re finished working with the file (I have) and then close the file. I’ve got no other documents open in Word, so I can just close Word from the task bar like this:

Close_window_ContextItem

Go back to the PowerShell session and repeat the previous command. You should see one less line of output:

PS Z:\> handle I:\

 

Handle v3.5

Copyright (C) 1997-2012 Mark Russinovich

Sysinternals - www.sysinternals.com

 

System             pid: 4      type: File          2990: I:\$Extend\$RmMetadata\$TxfLog\$TxfLogContainer00000000000000000002

System             pid: 4      type: File          3464: I:\$Extend\$RmMetadata\$Txf

System             pid: 4      type: File          3DB8: I:\$Extend\$RmMetadata\$TxfLog\$TxfLog.blf

System             pid: 4      type: File          3F9C: I:\$Extend\$RmMetadata\$TxfLog\$TxfLogContainer00000000000000000001

The handle output now shows that only Windows’ internal System process has opened files on the device. I’m not sure what the files in I:\$Extend\$RmMetadata are for, but they looks like something important to Windows.

If your output shows more lines than this, then continue to close files until only the System process holds open files.

If you try eject the device again, you should see a popup indicating successful removal like this:

Safe_To_Remove_Hardware_Popup

You can now safely yank the cable!

A problem with the solution

Sometimes, even though the System process is the only one holding opening files, you will still see the error dialog when you try to eject the device.

Right now the same handle command produces similar output, but it shows that the System process now also has a handle on the root directory of the device:

PS Z:\> handle I:\

 

Handle v3.5

Copyright (C) 1997-2012 Mark Russinovich

Sysinternals - www.sysinternals.com

 

System             pid: 4      type: File          2200: I:\$Extend\$RmMetadata\$TxfLog\$TxfLogContainer00000000000000000001

System             pid: 4      type: File          314C: I:\$Extend\$RmMetadata\$TxfLog\$TxfLogContainer00000000000000000002

System             pid: 4      type: File          3670: I:\$Extend\$RmMetadata\$Txf

System             pid: 4      type: File          3AAC: I:\

System             pid: 4      type: File          5C80: I:\$Extend\$RmMetadata\$TxfLog\$TxfLog.blf

The System is greedily and unjustifiably  hogging my device. I want it back, even if I have to kill the System process to do that.

The System process starts before I get to control the computer, so it’s not really mine to kill. We can be kinder by using handle to break its fingers instead of killing it outright.

The handle utility lets you close a file handle by force if you give it a couple of magic numbers. In each line of output, the number after pid: is the ID of the process, and the number before the file location is the ID of the handle.

From the above output, you can close the file handle of the System process on the root directory of the device using handle like this:

PS Z:\> handle -c 3AAC -p 4

 

Handle v3.5

Copyright (C) 1997-2012 Mark Russinovich

Sysinternals - www.sysinternals.com

 

 3AAC: File  (RW-)   I:\

Close handle 3AAC in System (PID 4)? (y/n)

Because forcing a file handle to close is a potentially dangerous operation, handle asks you to confirm the action before committing it.

I’m feeling lucky, so I press y then return. The remaining output looks like this:

Error closing handle:

T

Bugger.

After a little searching, it seems that this is a well-known problem on TechArena and Talking Technical, and it’s existed since Windows XP.

I tried the close-and-reopen-Explorer workaround on TechArena. It didn’t affect the output of handle and I still can’t eject my device.

I tried searching for services as suggested on Talking Technical, but I couldn’t find any.

I don’t have time left to try anything else.The only thing left to do for now is to shut down the computer before removing the drive.

Kinda defeats the purpose of having a removable drive, eh?

RoadRunner

David Embley replied to say the the code for his table-understanding tool is not available.

A small disappointment, but there’s plenty more to discover. His effort is just one in a whole community of people working towards automatic data extraction.

RoadRunner is a tool developed by the database groups at Università di Roma Tre and Università della Basilicata for automating the generation of ‘wrappers’ (or scrapers) to extract data from HTML pages.

The source code of a prototype system is realeased under  GPL licence.

I’m looking forward to using RoadRunner with my growing dataset.

Of the 28,000 pages I predict I will collect, I have about 3,000.

I’ve realised that starting a new session of HTTrack overwrites the log of the previous session, so I’ve started copying out the logs to a safe place for later analysis.

A Table-Understanding Tool

HTTrack is solving my data-collection problem with aplomb, so I’ve started thinking about how to turn the raw HTML into something I can analyze more easily.

Computer scientists David Embley and Cui Tao in 2005 published a paper called “Automating the Extraction of Data from HTML Tables with Unknown Structure”.

They describe a technique that allows one extract data from a semi-structured source with an unknown schema by mapping it to a known schema using a expert’s ontology of the information domain.

The paper shows impressive practical results for the “table-understanding tool”. If I can acquire a copy, I’ll be keen to try it out on my own collection of HTML tables.

Use a crawler instead of writing a custom scraper

My council tax band analysis project is moving again. I’m trying to tackle the extraction part of the project from a different angle using different tools.

My first attempt to solve the problem was to write a Python program using the Scrapy framework to crawl the source site and parse out all the data I wanted to analyse. My program was a little too aggressive for the source site, and I only got part of the data before getting shut out. So now I have to get more clever.

Today I chose to split the extraction task in to a collection step and a parsing step. The collection step is complex, but good tools exist to handle the complexity. The parsing step is requires site-specific code, and I’ve already written that.

Using HTTrack I created a site mirroring project that crawls the source the same way as before. On Windows HTTrack comes with a GUI for configuring the crawling engine that lets you specify which resources to look for, which to ignore, how it presents itself, and how aggressive it is.

After using this to collect the raw responses from the source, I shoud be able to use the parsing code of the Python program to turn a bunch of HTML files into a single CSV file, or perhaps a SQLite database.

Tonight I had more trouble from the source site, but I hope this might be because of nightly maintenance rather than that I’m still not being discrete enough.

The source presents chunks of data in a way that is straightforward to parse, but hard to aggregate. The source is optimized for human lookups at the postcode level and address level rather than for a bulk download of the data. I haven’t worked out a way of getting the source to emit all the data in less than hundreds of thousands of separate HTTP responses.

I wrote the scraper with the naive assumption that I could just collect data as fast as the network would allow. I didn’t implement any throttling or scheduling logic; just left the scraper running overnight and hoped for the best!

My Fiddler HTTP log shows that the server started responding strangely after about 3000 requests, with a mix of HTTP 500 status codes, slow responses, and explicit go-away messages like “You have exceeded your request limit for this period. Try again later”. These messages marked the end of any useful data my computer would receive for a while, so I had to cancel the job with no easy way to pick up from where I left off.

Scrapy lets you control the frequency of requests, so it would be a simple modification to limit my scraper to make one request every 30 seconds or so, set it off, and come back in a month.

The longer it takes to complete the extraction process, the more likely it is to be interrupted. I don’t really want to write code to handle all the maintenance of that state. I just want the data, dammit!

Unfortunately, I don’t have a spare computer that I can reliably donate to the process. Amazon has a suitable platform, but it is expensive. ScraperWiki is free and has all the right tools, but the platform is too constrained for a task of this size.

Using HTTrack still doesn’t solve my resource problem, but now I can focus on logistics rather than programming.

Follow

Get every new post delivered to your Inbox.

Join 248 other followers