Use handle to disconnect external storage safely… sometimes

The problem

Yesterday I received my shiny Samsung M3 1TB Portable Hard Drive from Amazon to solve my storage problems (I hoard MP3s). It looks like this:

Samsung M3 1TB Slimline

A static picture does little justice to the case. It’s not covered by a tacky plaid pattern; the surface is all angular. You should see the light reflects beautifully when you wobble it about.

Today I plugged it in to my workstation to transfer a few personal downloads. When I had finished, I played the responsible user and safely disconnected the device before yanking the USB cable from the socket.

In Windows 7, you do that by context-clicking the system tray icon that looks like a USB connector and choosing ‘Eject Samsung M3 Portable’ or whatever matches the name of your device:

Eject_Samsung_M3_Portable_ContextItem

Instead of rewarding me with a signal that I could now yank my device from the socket, Windows gave an impudent error dialog that declares “This device is currently in use”:

Problem_Ejecting_USB_Mass_Storage_Device_Dialog

The dialog indirectly warns me that I could potentially trash data by ejecting the device prematurely. However, the proposed resolution to “Close any programs or windows that might be using the device, and then try again.” is vague, and that sucks.

I multitask a lot at my workstation, and have open files all over the place. I don’t feel like wading through each window to find the one that won’t let go.

A solution

Thankfully, you can make up for the dialog’s shortcomings with Mark Russinovich’s awesome Handle utility. The following steps assume you have put handle.exe in your Windows folder so that it is on the path.

Start a new PowerShell session and use handle to search for processes that have a file open anywhere on the device. For me, the root directory of my device is I:\, so the command and output look like this:

PS Z:\> handle I:\

 

Handle v3.5

Copyright (C) 1997-2012 Mark Russinovich

Sysinternals - www.sysinternals.com

 

System             pid: 4      type: File          2990: I:\$Extend\$RmMetadata\$TxfLog\$TxfLogContainer00000000000000000002

System             pid: 4      type: File          3464: I:\$Extend\$RmMetadata\$Txf

System             pid: 4      type: File          3DB8: I:\$Extend\$RmMetadata\$TxfLog\$TxfLog.blf

System             pid: 4      type: File          3F9C: I:\$Extend\$RmMetadata\$TxfLog\$TxfLogContainer00000000000000000001

WINWORD.EXE        pid: 6748   type: File           100: I:\Iain Elder.docx

After the copyright notice, each line of output represents a file handle – a process that has opened a file.

The last line of output is the useful one here. It means that the file I:\Iain Elder.docx, a copy of my CV, is open in Microsoft Word, whose executable name is WINWORD.EXE.

Make sure you’re finished working with the file (I have) and then close the file. I’ve got no other documents open in Word, so I can just close Word from the task bar like this:

Close_window_ContextItem

Go back to the PowerShell session and repeat the previous command. You should see one less line of output:

PS Z:\> handle I:\

 

Handle v3.5

Copyright (C) 1997-2012 Mark Russinovich

Sysinternals - www.sysinternals.com

 

System             pid: 4      type: File          2990: I:\$Extend\$RmMetadata\$TxfLog\$TxfLogContainer00000000000000000002

System             pid: 4      type: File          3464: I:\$Extend\$RmMetadata\$Txf

System             pid: 4      type: File          3DB8: I:\$Extend\$RmMetadata\$TxfLog\$TxfLog.blf

System             pid: 4      type: File          3F9C: I:\$Extend\$RmMetadata\$TxfLog\$TxfLogContainer00000000000000000001

The handle output now shows that only Windows’ internal System process has opened files on the device. I’m not sure what the files in I:\$Extend\$RmMetadata are for, but they looks like something important to Windows.

If your output shows more lines than this, then continue to close files until only the System process holds open files.

If you try eject the device again, you should see a popup indicating successful removal like this:

Safe_To_Remove_Hardware_Popup

You can now safely yank the cable!

A problem with the solution

Sometimes, even though the System process is the only one holding opening files, you will still see the error dialog when you try to eject the device.

Right now the same handle command produces similar output, but it shows that the System process now also has a handle on the root directory of the device:

PS Z:\> handle I:\

 

Handle v3.5

Copyright (C) 1997-2012 Mark Russinovich

Sysinternals - www.sysinternals.com

 

System             pid: 4      type: File          2200: I:\$Extend\$RmMetadata\$TxfLog\$TxfLogContainer00000000000000000001

System             pid: 4      type: File          314C: I:\$Extend\$RmMetadata\$TxfLog\$TxfLogContainer00000000000000000002

System             pid: 4      type: File          3670: I:\$Extend\$RmMetadata\$Txf

System             pid: 4      type: File          3AAC: I:\

System             pid: 4      type: File          5C80: I:\$Extend\$RmMetadata\$TxfLog\$TxfLog.blf

The System is greedily and unjustifiably  hogging my device. I want it back, even if I have to kill the System process to do that.

The System process starts before I get to control the computer, so it’s not really mine to kill. We can be kinder by using handle to break its fingers instead of killing it outright.

The handle utility lets you close a file handle by force if you give it a couple of magic numbers. In each line of output, the number after pid: is the ID of the process, and the number before the file location is the ID of the handle.

From the above output, you can close the file handle of the System process on the root directory of the device using handle like this:

PS Z:\> handle -c 3AAC -p 4

 

Handle v3.5

Copyright (C) 1997-2012 Mark Russinovich

Sysinternals - www.sysinternals.com

 

 3AAC: File  (RW-)   I:\

Close handle 3AAC in System (PID 4)? (y/n)

Because forcing a file handle to close is a potentially dangerous operation, handle asks you to confirm the action before committing it.

I’m feeling lucky, so I press y then return. The remaining output looks like this:

Error closing handle:

T

Bugger.

After a little searching, it seems that this is a well-known problem on TechArena and Talking Technical, and it’s existed since Windows XP.

I tried the close-and-reopen-Explorer workaround on TechArena. It didn’t affect the output of handle and I still can’t eject my device.

I tried searching for services as suggested on Talking Technical, but I couldn’t find any.

I don’t have time left to try anything else.The only thing left to do for now is to shut down the computer before removing the drive.

Kinda defeats the purpose of having a removable drive, eh?

Advertisements

RoadRunner

David Embley replied to say the the code for his table-understanding tool is not available.

A small disappointment, but there’s plenty more to discover. His effort is just one in a whole community of people working towards automatic data extraction.

RoadRunner is a tool developed by the database groups at Università di Roma Tre and Università della Basilicata for automating the generation of ‘wrappers’ (or scrapers) to extract data from HTML pages.

The source code of a prototype system is realeased under  GPL licence.

I’m looking forward to using RoadRunner with my growing dataset.

Of the 28,000 pages I predict I will collect, I have about 3,000.

I’ve realised that starting a new session of HTTrack overwrites the log of the previous session, so I’ve started copying out the logs to a safe place for later analysis.

A Table-Understanding Tool

HTTrack is solving my data-collection problem with aplomb, so I’ve started thinking about how to turn the raw HTML into something I can analyze more easily.

Computer scientists David Embley and Cui Tao in 2005 published a paper called “Automating the Extraction of Data from HTML Tables with Unknown Structure”.

They describe a technique that allows one extract data from a semi-structured source with an unknown schema by mapping it to a known schema using a expert’s ontology of the information domain.

The paper shows impressive practical results for the “table-understanding tool”. If I can acquire a copy, I’ll be keen to try it out on my own collection of HTML tables.

Use a crawler instead of writing a custom scraper

My council tax band analysis project is moving again. I’m trying to tackle the extraction part of the project from a different angle using different tools.

My first attempt to solve the problem was to write a Python program using the Scrapy framework to crawl the source site and parse out all the data I wanted to analyse. My program was a little too aggressive for the source site, and I only got part of the data before getting shut out. So now I have to get more clever.

Today I chose to split the extraction task in to a collection step and a parsing step. The collection step is complex, but good tools exist to handle the complexity. The parsing step is requires site-specific code, and I’ve already written that.

Using HTTrack I created a site mirroring project that crawls the source the same way as before. On Windows HTTrack comes with a GUI for configuring the crawling engine that lets you specify which resources to look for, which to ignore, how it presents itself, and how aggressive it is.

After using this to collect the raw responses from the source, I shoud be able to use the parsing code of the Python program to turn a bunch of HTML files into a single CSV file, or perhaps a SQLite database.

Tonight I had more trouble from the source site, but I hope this might be because of nightly maintenance rather than that I’m still not being discrete enough.

The source presents chunks of data in a way that is straightforward to parse, but hard to aggregate. The source is optimized for human lookups at the postcode level and address level rather than for a bulk download of the data. I haven’t worked out a way of getting the source to emit all the data in less than hundreds of thousands of separate HTTP responses.

I wrote the scraper with the naive assumption that I could just collect data as fast as the network would allow. I didn’t implement any throttling or scheduling logic; just left the scraper running overnight and hoped for the best!

My Fiddler HTTP log shows that the server started responding strangely after about 3000 requests, with a mix of HTTP 500 status codes, slow responses, and explicit go-away messages like “You have exceeded your request limit for this period. Try again later”. These messages marked the end of any useful data my computer would receive for a while, so I had to cancel the job with no easy way to pick up from where I left off.

Scrapy lets you control the frequency of requests, so it would be a simple modification to limit my scraper to make one request every 30 seconds or so, set it off, and come back in a month.

The longer it takes to complete the extraction process, the more likely it is to be interrupted. I don’t really want to write code to handle all the maintenance of that state. I just want the data, dammit!

Unfortunately, I don’t have a spare computer that I can reliably donate to the process. Amazon has a suitable platform, but it is expensive. ScraperWiki is free and has all the right tools, but the platform is too constrained for a task of this size.

Using HTTrack still doesn’t solve my resource problem, but now I can focus on logistics rather than programming.