A Table-Understanding Tool

HTTrack is solving my data-collection problem with aplomb, so I’ve started thinking about how to turn the raw HTML into something I can analyze more easily.

Computer scientists David Embley and Cui Tao in 2005 published a paper called “Automating the Extraction of Data from HTML Tables with Unknown Structure”.

They describe a technique that allows one extract data from a semi-structured source with an unknown schema by mapping it to a known schema using a expert’s ontology of the information domain.

The paper shows impressive practical results for the “table-understanding tool”. If I can acquire a copy, I’ll be keen to try it out on my own collection of HTML tables.


2 thoughts on “A Table-Understanding Tool

  1. That’s really interesting. How do you rate nosql solutions like CouchDB, mongoose, etc.? And how do they fit into this picture?

    PS: When you use a phrase like ‘x solved my y problems’ can you hyperlink ‘y problems’ in the phrase to the relevent blogpost? Its a chore, but it helps both human and bot visitors to your site to contextualise your posts. 🙂

