HTTrack is solving my data-collection problem with aplomb, so I’ve started thinking about how to turn the raw HTML into something I can analyze more easily.
Computer scientists David Embley and Cui Tao in 2005 published a paper called “Automating the Extraction of Data from HTML Tables with Unknown Structure”.
They describe a technique that allows one extract data from a semi-structured source with an unknown schema by mapping it to a known schema using a expert’s ontology of the information domain.
The paper shows impressive practical results for the “table-understanding tool”. If I can acquire a copy, I’ll be keen to try it out on my own collection of HTML tables.