ISCA Archive IWSLT 2010
ISCA Archive IWSLT 2010

The pay-offs of preprocessing for German-English statistical machine translation

Ilknur Durgar El-Kahlout, Francois Yvon

In this paper, we present the result of our work on improving the preprocessing for German-English statistical machine translation. We implemented and tested various improvements aimed at i) converting German texts to the new orthographic conventions; ii) performing a new tokenization for German; iii) normalizing lexical redundancy with the help of POS tagging and morphological analysis; iv) splitting German compound words with frequency based algorithm and; v) reducing singletons and out-of-vocabulary words. All these steps are performed during preprocessing on the German side. Combining all these processes, we reduced 10% of the singletons, 2% OOV words, and obtained 1.5 absolute (7% relative) BLEU improvement on the WMT 2010 German to English News translation task.


Cite as: El-Kahlout, I.D., Yvon, F. (2010) The pay-offs of preprocessing for German-English statistical machine translation. Proc. International Workshop on Spoken Language Translation (IWSLT 2010), 251-258

@inproceedings{elkahlout10_iwslt,
  author={Ilknur Durgar El-Kahlout and Francois Yvon},
  title={{The pay-offs of preprocessing for German-English statistical machine translation}},
  year=2010,
  booktitle={Proc. International Workshop on Spoken Language Translation (IWSLT 2010)},
  pages={251--258}
}