International Workshop on Spoken Language Translation (IWSLT) 2009
We tried to cope with the complex morphology of Turkish by applying different schemes of morphological word segmentation to the training and test data of a phrase-based statistical machine translation system. These techniques allow for a considerable reduction of the training dictionary, and lower the out-of-vocabulary rate of the test set. By minimizing differences between lexical granularities of Turkish and English we can produce more refined alignments and a better modeling of the translation task. Morphological segmentation is highly language dependent and requires a fair amount of linguistic knowledge in its development phase. Yet it is fast and light-weight does not involve syntax and appears to benefit our IWSLT09 system: our best segmentation scheme associated to a simple lexical approximation technique achieved a 50% reduction of out-of-vocabulary rate and over 5 point BLEU improvement above the baseline.
Full Paper Presentation (pdf)
Bibliographic reference. Bisazza, Arianna / Federico, Marcello (2009): "Morphological pre-processing for Turkish to English statistical machine translation", In IWSLT-2009, 129-135.