12th Annual Conference of the International Speech Communication Association

Florence, Italy
August 27-31. 2011

A Scalable Approach to Building a Parallel Corpus from the Web

Vivek Kumar Rangarajan Sridhar, Luciano Barbosa, Srinivas Bangalore

AT&T Labs Research, USA

Parallel text acquisition from the Web is an attractive way for augmenting statistical models (e.g., machine translation, cross-lingual document retrieval) with domain representative data. The basis for obtaining such data is a collection of pairs of bilingual Web sites or pages. In this work, we propose a crawling strategy that locates bilingual Web sites by constraining the visitation policy of the crawler to the graph neighborhood of bilingual sites on the Web. Subsequently, we use a novel recursive mining technique that recursively extracts text and links from the collection of bilingual Web sites obtained from the crawling. Our method does not suffer from the computationally prohibitive combinatorial matching typically used in previous work that uses document retrieval techniques to match a collection of bilingual webpages. We demonstrate the efficacy of our approach in the context of machine translation in the tourism and hospitality domain. The parallel text obtained using our novel crawling strategy results in a relative improvement of 21% in BLEU score (English-to-Spanish) over an out-of-domain seed translation model trained on the European parliamentary proceedings.

Full Paper

Bibliographic reference.  Sridhar, Vivek Kumar Rangarajan / Barbosa, Luciano / Bangalore, Srinivas (2011): "A scalable approach to building a parallel corpus from the web", In INTERSPEECH-2011, 2113-2116.