ISCA Archive Interspeech 2008
ISCA Archive Interspeech 2008

Rich morphology based n-gram language models for Arabic

Ahmad Emami, Imed Zitouni, Lidia Mangu

In this paper we investigate the use of rich morphology such as word segmentation, part-of-speech tagging and diacritic restoration to improve Arabic language modeling. We enrich the context by performing morphological analysis on the word history. We use neural network models to integrate this additional information, due to their ability to handle long and enriched dependencies. We experimented with models with increasing order of morphological features, starting with Arabic segmentation, and later adding part of speech labels as well as words with restored diacritics. Experiments on Arabic broadcast news and broadcast conversations data showed significant improvements in perplexity, reducing the baseline N-gram and the neural network N-gram model perplexities by 35% and 31% respectively.

doi: 10.21437/Interspeech.2008-252

Cite as: Emami, A., Zitouni, I., Mangu, L. (2008) Rich morphology based n-gram language models for Arabic. Proc. Interspeech 2008, 829-832, doi: 10.21437/Interspeech.2008-252

  author={Ahmad Emami and Imed Zitouni and Lidia Mangu},
  title={{Rich morphology based n-gram language models for Arabic}},
  booktitle={Proc. Interspeech 2008},