5th International Conference on Spoken Language Processing

Sydney, Australia
November 30 - December 4, 1998

SEMOLE: A Robust Framework For Gathering Information From The World Wide Web

Hyung-Jin Kim, Lee Hetherington

Spoken Language Systems Group - MIT Laboratory for Computer Science, USA

This paper describes seMole (se-mantic Mole), a robust framework for harvesting information from the World Wide Web. Unlike commercially available harvesting programs that use absolute addressing, seMole uses a semantic addressing scheme to gather information from HTML pages. Instead of relying on the HTML structure to locate data, semantic addressing relies on the relative position of key/value pairs to locate data. This scheme abstracts away from the underlying HTML structure of Web pages, allowing information gathering to only depend on the content of pages, which in large part does not change over time. We use this framework to gather information from various data sources including Boston Sidewalk and the CNN Weather Site. Through these experiments we find that seMole is more robust to changes in the Web sites and it is simpler to use and maintain than systems that use absolute addressing.

