Interspeech'2005 - Eurospeech

Lisbon, Portugal
September 4-8, 2005

Language Model Adaptation for Resource Deficient Languages Using Translated Data

Arnar Thor Jensson, Edward W. D. Whittaker, Koji Iwano, Sadaoki Furui

Tokyo Institute of Technology, Japan

Text corpus size is an important issue when building a language model (LM). This is a particularly important issue for languages where little data is available. This paper introduces a technique to improve a LM built using a small amount of task dependent text with the help of a machine-translated text corpus. Perplexity experiments were performed using data, machine translated (MT) from English to French on a sentence-by-sentence basis and using dictionary lookup on a word-by-word basis. Then perplexity and word error rate experiments using MT data from English to Icelandic were done on a word-by-word basis. For the latter, the baseline word error rate was 44.0%. LM interpolation reduced word error rate significantly to 39.2%.

Full Paper

Bibliographic reference.  Jensson, Arnar Thor / Whittaker, Edward W. D. / Iwano, Koji / Furui, Sadaoki (2005): "Language model adaptation for resource deficient languages using translated data", In INTERSPEECH-2005, 1329-1332.