SLTU-2008 - First International Workshop on Spoken Languages Technologies for Under-Resourced Languages

Hanoi, Vietnam
May 5-7, 2008

Translation of Unknown Words in Phrase-Based Statistical Machine Translation for Languages of Rich Morphology

Karunesh Arora (1), Michael Paul(2), Eiichiro Sumita (2)

(1) CDAC, Noida, India; (2) NICT/ATR, Keihanna Science City, Kyoto, Japan

This paper proposes a method for handling out-of-vocabulary (OOV) words that cannot be translated using conventional phrase-based statistical machine translation (SMT) systems. For a given OOV word, lexical approximation techniques are utilized to identify spelling and inflectional word variants that occur in the training data. All OOV words in the source sentence are replaced with appropriate word variants that are found in the training corpus, thus reducing the amount of OOV words in the input. Moreover, in order to increase the coverage of such word translations, the SMT translation model is extended by adding new phrase translations for all source language words that do not have a single-word entry in the original phrase-table, but only appear in the context of larger phrases. The effectiveness of the proposed method is investigated for translations of Hindi-to-Japanese. The methodology can easily be extended for other language pairs of rich morphology.

Index Terms— statistical MT, out-of-vocabulary words, lexical approximation, phrase-table extension

Full Paper

Bibliographic reference.  Arora, Karunesh / Pau, Michael / Sumita, Eiichiro (2008): "Translation of unknown words in phrase-based statistical machine translation for languages of rich morphology", In SLTU-2008, 70-75.