Loading a flat file into a table

Follow these steps to load a Google Docs flat file export into a temporary table in an SQL Server database using the SQL Server Import and Export Wizard.

Flat files are a portable way to share data, but it’s difficult to analyze and play with the data until it’s in a database. This guide shows you how to do it using a SQL Server tool.

In this example, I’m going to load urban area population data scraped from Wikipedia.

Acquire the flat file

To prepare for this example, I already extracted data from the English Wikipedia’s article on the largest urban areas of the European Union using Google Docs. My guide on extracting data from Wikipedia using Google Docs shows you how to do the same.

On my own computer, the file is saved to “C:\Users\iain.CORP\Blog\Largest urban areas of the European Union – ImportHtmlOutput.tsv”. Remember your path, because you’ll need it soon.

For your convenience, I have uploaded a copy of the file. The XLSX extension is a workaround for WordPress.com’s stupid file extension filter. Remove the XLSX extension before working with this file.

Start the import process

Start the SQL Server Import and Export Wizard by running the following command:

DTSWizard

The SQL Server Import and Export Wizard dialog should appear. Continue to configure the first panel.

Choose a data source

Follow these steps to configure the first panel of the dialog:

  1. Select ‘Flat File Source’ as your data source and wait for the interface to load.
  2. Paste without quotes the full path of your flat file into the file name field.
  3. Select ‘English (United Kingdom)’ as your locale. (Default for me; others may work.)
  4. Specify the file encoding. For UTF-8 files, such as those exported from Google Docs, select ‘65001 (UTF-8)’ as your code page. Leave the Unicode box unchecked. (In Windows-speak, Unicode means little-endian UTF-16.)
  5. Select ‘Delimited’ as your format. (Default)
  6. Set ‘<none>’ as your text qualifier. (Default)
  7. Set ‘{LF}’ as the header row delimiter. (The default Windows line ending is CRLF, but Google uses the Unix default.)
  8. Set header rows to skip as 0.
  9. Check ‘Column names in the first data row’.

The dialog should look like this:

Import_Data_Source

In a serious ETL project, where imports would be regular and automated, you would want to specify precisely the data types in each field of your file to reject bad data files early. For this quick-and-dirty task, the defaults are fine: everything is a string up to 50 characters long.

Click Next until you get to choose a destination.

Choose a destination

Follow these steps to configure the dialog:

  1. Select ‘SQL Server Native Client 11.0’ as your destination, or your nearest match if you’re not using SQL Server 2012.
  2. Set ‘.’ as your server name if you are working locally; otherwise put the name of your server instance here.
  3. Use Windows Authentication. (Unless you have to use SQL Server Authentication for some reason.)
  4. Select tempdb as your database.

The dialog should look like this:

Import_Wizard_Destination

Using tempdb indicates to users that the data is not a permanent part of any schema. It also means the table will no longer exist after the server instance is restarted, so you don’t need to remember to drop it on a development machine.

It’s not a good idea to create tables like this in a user database, because it will confuse other users and could end up becoming part of your version-controlled schema, depending on how you use your tools.

Click Next to define the destination table.

Define a destination table

You should see the ‘Select Source Tables and Views’ panel, which looks like this:

Import_Destination_Table

By default, the destination table has the same name of the file being imported. With a good query-writing tool like SQL Prompt, this should not cause any difficulties, but it might be cumbersome to read. If so, you can rename it to something shorter: for example, [dbo].[Largest Urban Areas of the European Union] could become [EU_Urban_Areas], as above.

To rename the destination table, click the name in the destination column and type a new name.

Click ‘Edit Mappings…’ to open the Column Mappings dialog.

Change the type of every column to nvarchar to preserve all the character data. If you don’t do this, the database will decode the bytes using its default code page, which will most likely mangle anything above code point 127 (limit of US-ASCII).

The default settings for everything else should be fine.

The dialog should look like this:

Import_Column_Mappings

Click OK.

Click Next on the main dialog to review your configuration.

Review configuration and error-handling

The review panel will warn about possible data loss or errors for each column using a yellow triangle. It looks like this:

Import_Review_Configuration

In this example, it’s safe to ignore the warnings.

The default behaviour to handle and error or a truncation is to fail. You can opt to ignore this on a per-row basis. I like to get all the data or none of it, so I don’t change the default.

If your data is particularly poorly formatted, you might want change the setting to ‘Ignore’ to import just the good rows so that you can so you have something to play with before devising a data-cleansing strategy.

Click Next to output options.

Import the data

You should see the ‘Save and Run Package’ dialog, which looks like this:

Import_Run_And_Save

The wizard is about to run a SSIS package to perform the data import task. You could create the same package using the Business Intelligence Development Studio, but this the wizard is much easier for one-off tasks like this.

You might want to adapt the wizard’s package to create a component for a serious ETL project, so you can save the package.

But for this example, it’s okay just to run and forget.

Check ‘Run immediately’ and click Next to see a complete summary of the actions that the wizard will perform:

Import_Action_Summary

Click Finish, finally, to import the data!

If all goes well, you should see a succession of green ticks appear in a list like this:

Import_Successful

You have successfully imported the data from the flat file.

Click Close to dispose of the dialog.

Inspecting the data

Start SQL Server Management Studio, connect to the destination database, and open a new query window using tempdb.

Run the following query to inspect the data:

SELECT *

FROM dbo.EU_Urban_Areas;

The result set should look like this:

SSMS_Table_Inspection

Success! You can now query the data using SQL.

You can see from the screenshot that everything is stored as a string, even the numerical data. To perform analytical work on this data, you’ll have to import it into a table with the correct data types. But that’s for another post!

Advertisements

Time-Lapse Photography at Yosemite Park

Saw this awesome footage of the night sky on Boing Boing today. Hundreds of star-lovers gather in Yosemite Park to film the night sky. It’s so hauntingly beautiful, but it looks unreal in some ways too.

This reminds me of my friend Dan’s time-lapse photography taken at Castle Sands in Edinburgh. With his permission, I’ll publish a photo here.

Can’t copy folder containing images using Windows Explorer

I’m trying to organize my music folder. I want to put my fwd.dj mixes, currently in the root of %UserProfile%\Music, in the new Mixes folder I just created.

To create a richer experience when browsing my own music collection, I also downloaded the profile pages of each of the mixes, with tracklists, images of the artists, and general liner notes gubbins.

Unfortunately, the images stop me copying the folder, apparently because explorer has locked thumbs.db. This issue is well-reported in a Microsoft Technet thread.

After I applied Noel Carboni’s workaround to disable thumbnail caching and restarted explorer, I was able to move the folder without a problem. For reference, here’s the steps I followed:

·    Click the Start orb

·    Enter gpedit.msc in the search box and hit Enter.

·    Expand User Configuration – Administrative Templates – Windows Components.

·    Click on Windows Explorer.

·    Right-click the entry “Turn off the caching of thumbnails in hidden thumbs.db files” and choose Edit.

·    Enable the setting.

PS fwd.dj seems to be down today. The front page shows the IIS 7 default welcome page, and a request for any other page receives IIS’s default 404 response. I hope that doesn’t last long!

What is big data?

Big is a relative term. Data is pretty general.

Ryan Shuttleworth said yesterday at the Turing Festival that “big data forces you to innovate about how you store it and process it”. On the presentation display there was a number line going from 100GB to 1EB (or something utterly huge).

The exact behaviour will vary by application and the data model, but basically all algorithms and storage methods will work well enough on tiny data sets. It’s when the data starts to grow that you’ll notice problems at scale.

At Microsoft’s SQL Server 2012 launch in Edinburgh, the presenter talked about big data as having the four V’s: volume, velocity, variety, and variability. But that’s just replacing one relative term with four others.

MCTS Learning Resources

A Microsoft-certified colleague today pointed me towards some learning resources that could be useful for the MCTS SQL Server 2008 Database Developer exam 70-433 that I’m going to sit next week.

Both Transcender and MeasureUp offer 30-day online access to mock exams. Allegedly, they are even harder than the Microsoft questions, so if you can pass these, then you really know your stuff.

They are expensive, though, and I’m not sure I can justify the cost. Prometric already charges £99 to sit the exam itself.

MeasureUp charges £52.00 for 150 covering the following SQL Server knowledge areas:

Implementing Tables and Views – 21 questions
Implementing Programming Objects – 24 questions
Working with Query Fundamentals – 31 questions
Applying Additional Query Techniques – 23 questions
Working with Additional SQL Server Components – 17 questions
Working with XML Data – 18 questions
Gathering Performance Information – 16 questions

Transcender charges $109.00, which today on XE.com is about £68, for a similar offering:

168 practice exam questions
298 free TranscenderFlash cards
3 year license
Pass Guarantee

The Pass Guarantee claims “Pass the Exam the First Time—Or Your Money Back!”.

Today Amazon is selling the official Microsoft self-paced training kit for £35.74. The official kit includes a CD with the same software used at the test centres, with sample questions set by Microsoft. I already have a copy of this, and think it’s all the resource I need to pass this exam specifically.

The questions on the exam demo CD show you the areas your knowldge has to cover. It’s up to you to discover the answers in external resources.

Books Online is usually the first stop, because it’s Microsoft’s official documentation for the product. For a deeper understanding of fundamentals, Itzik Ben-Gan’s T-SQL Fundamentals is essential.

COALESCE and Data Type Precedence

The COALESCE function is great for implementing elegant conditional logic in SQL Server queries and procedures.

But it can bite you. I didn’t fully understand the implications of this statement in the documentation until today:

Returns the data type of expression with the highest data type precedence.

SQL Server has many built-in data types. Books Online has a page that ranks them by precedence.

I found a blog post that documents this better than I can do just now. The jist is that COALSCE can fail if its arguments are of incomparable types, and the order of the parameters can determine whether two values are compared.

 

Using Google Docs to scrape Wikipedia tables

Google Docs has an awesome function called ImportHtml. You can use it to import HTML tables into a spread sheet. You can then export the data as a flat file to import the data into your favorite database.

For example, this expression in cell A1 of a new spread sheet:

=ImportHtml(ʺhttp://en.wikipedia.org/wiki/ISO_3166-2ʺ, ʺtableʺ, 2)

Will fire up a scraper in the background to fill the cells around A1 with all the data from Wikipedia’s table of ISO 3166-2 country codes.

I exported the sheet as a CSV so you can see the results. The file contains 250 lines: the first line is the header, and the remaining lines are 249 records mapping country codes to their English names.

The tables rows containing non-ASCII characters are slightly mangled in the output. For example,  country AX is called “Aland Islands !Åland Islands” instead of just “Åland Islands”.

This seems to be because of the way Wikipedia hacks the table data to implement a sort order. The following HTML fragment renders the row for country AX:

<tr>
<td><a href="/wiki/ISO_3166-2:AX" title="ISO 3166-2:AX">AX</a></td>
<td><span style="display:none">Aland Islands !</span><span><a href="/wiki/%C3%85land_Islands" title="Åland Islands">Åland Islands</a></span></td>
<td>—</td>
</tr>

It seems that the real data is prepended with a copy stripped of dicritic marks. The copy is prepended to an exclamation mark. Is it significant that “!” is the first non-whitespace printable ASCII character?

Anyway, the hack is made invisible to users by CSS (style=”display:none”), but the Google Docs scraper appears to be ignorant of the CSS. It apparently just extracts the values from all the text nodes of the table data element. This is reasonable behvaior in a general-purpose table scraper.

So, the output of the ImportHtml function should go through some cleaning process before you can load the data into your base tables, but the function will definitely give you a head-start in building a data set with contributions from Wikipedia.

If I find the time, I’d like to ask someone at Wikipedia why they hack the table data like that.