9th Annual Conference of the International Speech Communication Association

Brisbane, Australia
September 22-26, 2008

Rich Morphology Based N-Gram Language Models for Arabic

Ahmad Emami, Imed Zitouni, Lidia Mangu

IBM T.J. Watson Research Center, USA

In this paper we investigate the use of rich morphology such as word segmentation, part-of-speech tagging and diacritic restoration to improve Arabic language modeling. We enrich the context by performing morphological analysis on the word history. We use neural network models to integrate this additional information, due to their ability to handle long and enriched dependencies. We experimented with models with increasing order of morphological features, starting with Arabic segmentation, and later adding part of speech labels as well as words with restored diacritics. Experiments on Arabic broadcast news and broadcast conversations data showed significant improvements in perplexity, reducing the baseline N-gram and the neural network N-gram model perplexities by 35% and 31% respectively.

Full Paper

Bibliographic reference.  Emami, Ahmad / Zitouni, Imed / Mangu, Lidia (2008): "Rich morphology based n-gram language models for Arabic", In INTERSPEECH-2008, 829-832.