ISCA Archive IWSLT 2005
ISCA Archive IWSLT 2005

Low cost Portability for statistical machine translation based on n-gram frequency and TF-IDF

Matthias Eck, Stephan Vogel, Alex Waibel

Statistical machine translation relies heavily on the available training data. In some cases it is necessary to limit the amount of training data that can be created for or actually used by the systems. We introduce weighting schemes which allow us to sort sentences based on the frequency of unseen n-grams. A second approach uses TF-IDF to rank the sentences. After sorting we can select smaller training corpora and we are able to show that systems trained on much less training data achieve a very competitive performance compared to baseline systems using all available training data.


Cite as: Eck, M., Vogel, S., Waibel, A. (2005) Low cost Portability for statistical machine translation based on n-gram frequency and TF-IDF. Proc. International Workshop on Spoken Language Translation (IWSLT 2005), 61-67

@inproceedings{eck05b_iwslt,
  author={Matthias Eck and Stephan Vogel and Alex Waibel},
  title={{Low cost Portability for statistical machine translation based on n-gram frequency and TF-IDF}},
  year=2005,
  booktitle={Proc. International Workshop on Spoken Language Translation (IWSLT 2005)},
  pages={61--67}
}