International Workshop on Spoken Language Translation (IWSLT) 2005

Pittsburgh, PA, USA
October 24-25, 2005

Low Cost Portability for Statistical Machine Translation based on N-gram Frequency and TF-IDF

Matthias Eck, Stephan Vogel, Alex Waibel

Interactive Systems Laboratories, Carnegie Mellon University, Pittsburgh, PA, USA

Statistical machine translation relies heavily on the available training data. In some cases it is necessary to limit the amount of training data that can be created for or actually used by the systems. We introduce weighting schemes which allow us to sort sentences based on the frequency of unseen n-grams. A second approach uses TF-IDF to rank the sentences. After sorting we can select smaller training corpora and we are able to show that systems trained on much less training data achieve a very competitive performance compared to baseline systems using all available training data.

Full Paper   

Bibliographic reference.  Eck, Matthias / Vogel, Stephan / Waibel, Alex (2005): "Low cost Portability for statistical machine translation based on n-gram frequency and TF-IDF", In IWSLT-2005, 61-67.