ISCA Archive Interspeech 2009
ISCA Archive Interspeech 2009

Iterative sentence-pair extraction from quasi-parallel corpora for machine translation

R. Sarikaya, Sameer Maskey, R. Zhang, Ea-Ee Jan, D. Wang, Bhuvana Ramabhadran, S. Roukos

This paper addresses parallel data extraction from the quasiparallel corpora generated in a crowd-sourcing project where ordinary people watch tv shows and movies and transcribe/translate what they hear, creating document pools in different languages. Since they do not have guidelines for naming and performing translations, it is often not clear which documents are the translations of the same show/movie and which sentences are the translations of the each other in a given document pair. We introduce a method for automatically pairing documents in two languages and extracting parallel sentences from the paired documents. The method consists of three steps: i) document pairing, ii) sentence pair alignment of the paired documents, and iii) context extrapolation to boost the sentence pair coverage. Human evaluation of the extracted data shows that 95% of the extracted sentences carry useful information for translation. Experimental results also show that using the extracted data provides significant gains over the baseline statistical machine translation system built with manually annotated data.

doi: 10.21437/Interspeech.2009-156

Cite as: Sarikaya, R., Maskey, S., Zhang, R., Jan, E.-E., Wang, D., Ramabhadran, B., Roukos, S. (2009) Iterative sentence-pair extraction from quasi-parallel corpora for machine translation. Proc. Interspeech 2009, 432-435, doi: 10.21437/Interspeech.2009-156

  author={R. Sarikaya and Sameer Maskey and R. Zhang and Ea-Ee Jan and D. Wang and Bhuvana Ramabhadran and S. Roukos},
  title={{Iterative sentence-pair extraction from quasi-parallel corpora for machine translation}},
  booktitle={Proc. Interspeech 2009},