David Embley replied to say the the code for his table-understanding tool is not available.
A small disappointment, but there’s plenty more to discover. His effort is just one in a whole community of people working towards automatic data extraction.
RoadRunner is a tool developed by the database groups at Università di Roma Tre and Università della Basilicata for automating the generation of ‘wrappers’ (or scrapers) to extract data from HTML pages.
The source code of a prototype system is realeased under GPL licence.
I’m looking forward to using RoadRunner with my growing dataset.
Of the 28,000 pages I predict I will collect, I have about 3,000.
I’ve realised that starting a new session of HTTrack overwrites the log of the previous session, so I’ve started copying out the logs to a safe place for later analysis.