Parallel text acquisition from the Web is an attractive way for augmenting statistical models (e.g., machine translation, cross-lingual document retrieval) with domain representative data. The basis for obtaining such data is a collection of pairs of bilingual Web sites or pages. In this work, we propose a crawling strategy that locates bilingual Web sites by constraining the visitation policy of the crawler to the graph neighborhood of bilingual sites on the Web. Subsequently, we use a novel recursive mining technique that recursively extracts text and links from the collection of bilingual Web sites obtained from the crawling. Our method does not suffer from the computationally prohibitive combinatorial matching typically used in previous work that uses document retrieval techniques to match a collection of bilingual webpages. We demonstrate the efficacy of our approach in the context of machine translation in the tourism and hospitality domain. The parallel text obtained using our novel crawling strategy results in a relative improvement of 21% in BLEU score (English-to-Spanish) over an out-of-domain seed translation model trained on the European parliamentary proceedings.
Bibliographic reference. Sridhar, Vivek Kumar Rangarajan / Barbosa, Luciano / Bangalore, Srinivas (2011): "A scalable approach to building a parallel corpus from the web", In INTERSPEECH-2011, 2113-2116.