One of the challenges related to large vocabulary Arabic speech recognition is the rich morphology nature of Arabic language which leads to both high out-of-vocabulary (OOV) rates and high language model (LM) perplexities. Another challenge is the absence of the short vowels (diacritics) from the Arabic written transcripts which causes a large difference between spoken and written language and thus a weaker connection between the acoustic and language models. In this work, we try to address these two important challenges by introducing both morphological decomposition and diacritization in Arabic language modeling. Finally, we are able to obtain about 3.7% relative reduction in word error rate (WER) with respect to a comparable non-diacritized full-words system running on our test set.
Bibliographic reference. El-Desoky, Amr / Gollan, Christian / Rybach, David / Schlüter, Ralf / Ney, Hermann (2009): "Investigating the use of morphological decomposition and diacritization for improving Arabic LVCSR", In INTERSPEECH-2009, 2679-2682.