Loading a flat file into a table

Follow these steps to load a Google Docs flat file export into a temporary table in an SQL Server database using the SQL Server Import and Export Wizard.

Flat files are a portable way to share data, but it’s difficult to analyze and play with the data until it’s in a database. This guide shows you how to do it using a SQL Server tool.

In this example, I’m going to load urban area population data scraped from Wikipedia.

Acquire the flat file

To prepare for this example, I already extracted data from the English Wikipedia’s article on the largest urban areas of the European Union using Google Docs. My guide on extracting data from Wikipedia using Google Docs shows you how to do the same.

On my own computer, the file is saved to “C:\Users\iain.CORP\Blog\Largest urban areas of the European Union – ImportHtmlOutput.tsv”. Remember your path, because you’ll need it soon.

For your convenience, I have uploaded a copy of the file. The XLSX extension is a workaround for WordPress.com’s stupid file extension filter. Remove the XLSX extension before working with this file.

Start the import process

Start the SQL Server Import and Export Wizard by running the following command:

DTSWizard

The SQL Server Import and Export Wizard dialog should appear. Continue to configure the first panel.

Choose a data source

Follow these steps to configure the first panel of the dialog:

  1. Select ‘Flat File Source’ as your data source and wait for the interface to load.
  2. Paste without quotes the full path of your flat file into the file name field.
  3. Select ‘English (United Kingdom)’ as your locale. (Default for me; others may work.)
  4. Specify the file encoding. For UTF-8 files, such as those exported from Google Docs, select ‘65001 (UTF-8)’ as your code page. Leave the Unicode box unchecked. (In Windows-speak, Unicode means little-endian UTF-16.)
  5. Select ‘Delimited’ as your format. (Default)
  6. Set ‘<none>’ as your text qualifier. (Default)
  7. Set ‘{LF}’ as the header row delimiter. (The default Windows line ending is CRLF, but Google uses the Unix default.)
  8. Set header rows to skip as 0.
  9. Check ‘Column names in the first data row’.

The dialog should look like this:

Import_Data_Source

In a serious ETL project, where imports would be regular and automated, you would want to specify precisely the data types in each field of your file to reject bad data files early. For this quick-and-dirty task, the defaults are fine: everything is a string up to 50 characters long.

Click Next until you get to choose a destination.

Choose a destination

Follow these steps to configure the dialog:

  1. Select ‘SQL Server Native Client 11.0’ as your destination, or your nearest match if you’re not using SQL Server 2012.
  2. Set ‘.’ as your server name if you are working locally; otherwise put the name of your server instance here.
  3. Use Windows Authentication. (Unless you have to use SQL Server Authentication for some reason.)
  4. Select tempdb as your database.

The dialog should look like this:

Import_Wizard_Destination

Using tempdb indicates to users that the data is not a permanent part of any schema. It also means the table will no longer exist after the server instance is restarted, so you don’t need to remember to drop it on a development machine.

It’s not a good idea to create tables like this in a user database, because it will confuse other users and could end up becoming part of your version-controlled schema, depending on how you use your tools.

Click Next to define the destination table.

Define a destination table

You should see the ‘Select Source Tables and Views’ panel, which looks like this:

Import_Destination_Table

By default, the destination table has the same name of the file being imported. With a good query-writing tool like SQL Prompt, this should not cause any difficulties, but it might be cumbersome to read. If so, you can rename it to something shorter: for example, [dbo].[Largest Urban Areas of the European Union] could become [EU_Urban_Areas], as above.

To rename the destination table, click the name in the destination column and type a new name.

Click ‘Edit Mappings…’ to open the Column Mappings dialog.

Change the type of every column to nvarchar to preserve all the character data. If you don’t do this, the database will decode the bytes using its default code page, which will most likely mangle anything above code point 127 (limit of US-ASCII).

The default settings for everything else should be fine.

The dialog should look like this:

Import_Column_Mappings

Click OK.

Click Next on the main dialog to review your configuration.

Review configuration and error-handling

The review panel will warn about possible data loss or errors for each column using a yellow triangle. It looks like this:

Import_Review_Configuration

In this example, it’s safe to ignore the warnings.

The default behaviour to handle and error or a truncation is to fail. You can opt to ignore this on a per-row basis. I like to get all the data or none of it, so I don’t change the default.

If your data is particularly poorly formatted, you might want change the setting to ‘Ignore’ to import just the good rows so that you can so you have something to play with before devising a data-cleansing strategy.

Click Next to output options.

Import the data

You should see the ‘Save and Run Package’ dialog, which looks like this:

Import_Run_And_Save

The wizard is about to run a SSIS package to perform the data import task. You could create the same package using the Business Intelligence Development Studio, but this the wizard is much easier for one-off tasks like this.

You might want to adapt the wizard’s package to create a component for a serious ETL project, so you can save the package.

But for this example, it’s okay just to run and forget.

Check ‘Run immediately’ and click Next to see a complete summary of the actions that the wizard will perform:

Import_Action_Summary

Click Finish, finally, to import the data!

If all goes well, you should see a succession of green ticks appear in a list like this:

Import_Successful

You have successfully imported the data from the flat file.

Click Close to dispose of the dialog.

Inspecting the data

Start SQL Server Management Studio, connect to the destination database, and open a new query window using tempdb.

Run the following query to inspect the data:

SELECT *

FROM dbo.EU_Urban_Areas;

The result set should look like this:

SSMS_Table_Inspection

Success! You can now query the data using SQL.

You can see from the screenshot that everything is stored as a string, even the numerical data. To perform analytical work on this data, you’ll have to import it into a table with the correct data types. But that’s for another post!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s