David Embley replied to say the the code for his table-understanding tool is not available.

A small disappointment, but there’s plenty more to discover. His effort is just one in a whole community of people working towards automatic data extraction.

RoadRunner is a tool developed by the database groups at Università di Roma Tre and Università della Basilicata for automating the generation of ‘wrappers’ (or scrapers) to extract data from HTML pages.

The source code of a prototype system is realeased under  GPL licence.

I’m looking forward to using RoadRunner with my growing dataset.

Of the 28,000 pages I predict I will collect, I have about 3,000.

I’ve realised that starting a new session of HTTrack overwrites the log of the previous session, so I’ve started copying out the logs to a safe place for later analysis.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s