This paper addresses parallel data extraction from the quasiparallel corpora generated in a crowd-sourcing project where ordinary people watch tv shows and movies and transcribe/translate what they hear, creating document pools in different languages. Since they do not have guidelines for naming and performing translations, it is often not clear which documents are the translations of the same show/movie and which sentences are the translations of the each other in a given document pair. We introduce a method for automatically pairing documents in two languages and extracting parallel sentences from the paired documents. The method consists of three steps: i) document pairing, ii) sentence pair alignment of the paired documents, and iii) context extrapolation to boost the sentence pair coverage. Human evaluation of the extracted data shows that 95% of the extracted sentences carry useful information for translation. Experimental results also show that using the extracted data provides significant gains over the baseline statistical machine translation system built with manually annotated data.
Bibliographic reference. Sarikaya, R. / Maskey, Sameer / Zhang, R. / Jan, Ea-Ee / Wang, D. / Ramabhadran, Bhuvana / Roukos, S. (2009): "Iterative sentence-pair extraction from quasi-parallel corpora for machine translation", In INTERSPEECH-2009, 432-435.