International Workshop on Spoken Language Translation (IWSLT) 2008

Honolulu, Hawaii, USA
October 20-21, 2008

Investigations on Large-Scale Lightly-Supervised Training for Statistical Machine Translation

Holger Schwenk

LIUM, University of Le Mans, France

Sentence-aligned bilingual texts are a crucial resource to build statistical machine translation (SMT) systems. In this paper we propose to apply lightly-supervised training to produce additional parallel data. The idea is to translate large amounts of monolingual data (up to 275M words) with an SMT system, and to use those as additional training data. Results are reported for the translation from French into English. We consider two setups: first the intial SMT system is only trained with a very limited amount of human-produced translations, and then the case where we have more than 100 million words. In both conditions, lightly-supervised training achieves significant improvements of the BLEU score.

Full Paper     Presentation (pdf)

Bibliographic reference.  Schwenk, Holger (2008): "Investigations on large-scale lightly-supervised training for statistical machine translation", In IWSLT-2008, 182-189.