My council tax band analysis project is moving again. I’m trying to tackle the extraction part of the project from a different angle using different tools.
My first attempt to solve the problem was to write a Python program using the Scrapy framework to crawl the source site and parse out all the data I wanted to analyse. My program was a little too aggressive for the source site, and I only got part of the data before getting shut out. So now I have to get more clever.
Today I chose to split the extraction task in to a collection step and a parsing step. The collection step is complex, but good tools exist to handle the complexity. The parsing step is requires site-specific code, and I’ve already written that.
Using HTTrack I created a site mirroring project that crawls the source the same way as before. On Windows HTTrack comes with a GUI for configuring the crawling engine that lets you specify which resources to look for, which to ignore, how it presents itself, and how aggressive it is.
After using this to collect the raw responses from the source, I shoud be able to use the parsing code of the Python program to turn a bunch of HTML files into a single CSV file, or perhaps a SQLite database.
Tonight I had more trouble from the source site, but I hope this might be because of nightly maintenance rather than that I’m still not being discrete enough.
The source presents chunks of data in a way that is straightforward to parse, but hard to aggregate. The source is optimized for human lookups at the postcode level and address level rather than for a bulk download of the data. I haven’t worked out a way of getting the source to emit all the data in less than hundreds of thousands of separate HTTP responses.
I wrote the scraper with the naive assumption that I could just collect data as fast as the network would allow. I didn’t implement any throttling or scheduling logic; just left the scraper running overnight and hoped for the best!
My Fiddler HTTP log shows that the server started responding strangely after about 3000 requests, with a mix of HTTP 500 status codes, slow responses, and explicit go-away messages like “You have exceeded your request limit for this period. Try again later”. These messages marked the end of any useful data my computer would receive for a while, so I had to cancel the job with no easy way to pick up from where I left off.
Scrapy lets you control the frequency of requests, so it would be a simple modification to limit my scraper to make one request every 30 seconds or so, set it off, and come back in a month.
The longer it takes to complete the extraction process, the more likely it is to be interrupted. I don’t really want to write code to handle all the maintenance of that state. I just want the data, dammit!
Unfortunately, I don’t have a spare computer that I can reliably donate to the process. Amazon has a suitable platform, but it is expensive. ScraperWiki is free and has all the right tools, but the platform is too constrained for a task of this size.
Using HTTrack still doesn’t solve my resource problem, but now I can focus on logistics rather than programming.