International Workshop on Spoken Language Translation (IWSLT) 2005
Pittsburgh, PA, USA
Statistical machine translation relies heavily on the available training data. In some cases it is necessary to limit the amount of training data that can be created for or actually used by the systems. We introduce weighting schemes which allow us to sort sentences based on the frequency of unseen n-grams. A second approach uses TF-IDF to rank the sentences. After sorting we can select smaller training corpora and we are able to show that systems trained on much less training data achieve a very competitive performance compared to baseline systems using all available training data.
Bibliographic reference. Eck, Matthias / Vogel, Stephan / Waibel, Alex (2005): "Low cost Portability for statistical machine translation based on n-gram frequency and TF-IDF", In IWSLT-2005, 61-67.