This paper describes seMole (se-mantic Mole), a robust framework for harvesting information from the World Wide Web. Unlike commercially available harvesting programs that use absolute addressing, seMole uses a semantic addressing scheme to gather information from HTML pages. Instead of relying on the HTML structure to locate data, semantic addressing relies on the relative position of key/value pairs to locate data. This scheme abstracts away from the underlying HTML structure of Web pages, allowing information gathering to only depend on the content of pages, which in large part does not change over time. We use this framework to gather information from various data sources including Boston Sidewalk and the CNN Weather Site. Through these experiments we find that seMole is more robust to changes in the Web sites and it is simpler to use and maintain than systems that use absolute addressing.
Cite as: Kim, H.-J., Hetherington, L. (1998) SEMOLE: a robust framework for gathering information from the world wide web. Proc. 5th International Conference on Spoken Language Processing (ICSLP 1998), paper 1076, doi: 10.21437/ICSLP.1998-696
@inproceedings{kim98d_icslp, author={Hyung-Jin Kim and Lee Hetherington}, title={{SEMOLE: a robust framework for gathering information from the world wide web}}, year=1998, booktitle={Proc. 5th International Conference on Spoken Language Processing (ICSLP 1998)}, pages={paper 1076}, doi={10.21437/ICSLP.1998-696} }