10th Annual Conference of the International Speech Communication Association

Brighton, United Kingdom
September 6-10, 2009

Investigating the Use of Morphological Decomposition and Diacritization for Improving Arabic LVCSR

Amr El-Desoky, Christian Gollan, David Rybach, Ralf Schlüter, Hermann Ney

RWTH Aachen University, Germany

One of the challenges related to large vocabulary Arabic speech recognition is the rich morphology nature of Arabic language which leads to both high out-of-vocabulary (OOV) rates and high language model (LM) perplexities. Another challenge is the absence of the short vowels (diacritics) from the Arabic written transcripts which causes a large difference between spoken and written language and thus a weaker connection between the acoustic and language models. In this work, we try to address these two important challenges by introducing both morphological decomposition and diacritization in Arabic language modeling. Finally, we are able to obtain about 3.7% relative reduction in word error rate (WER) with respect to a comparable non-diacritized full-words system running on our test set.

Full Paper

Bibliographic reference.  El-Desoky, Amr / Gollan, Christian / Rybach, David / Schlüter, Ralf / Ney, Hermann (2009): "Investigating the use of morphological decomposition and diacritization for improving Arabic LVCSR", In INTERSPEECH-2009, 2679-2682.