ISCA Archive Interspeech 2009
ISCA Archive Interspeech 2009

Investigating the use of morphological decomposition and diacritization for improving Arabic LVCSR

Amr El-Desoky, Christian Gollan, David Rybach, Ralf Schlüter, Hermann Ney

One of the challenges related to large vocabulary Arabic speech recognition is the rich morphology nature of Arabic language which leads to both high out-of-vocabulary (OOV) rates and high language model (LM) perplexities. Another challenge is the absence of the short vowels (diacritics) from the Arabic written transcripts which causes a large difference between spoken and written language and thus a weaker connection between the acoustic and language models. In this work, we try to address these two important challenges by introducing both morphological decomposition and diacritization in Arabic language modeling. Finally, we are able to obtain about 3.7% relative reduction in word error rate (WER) with respect to a comparable non-diacritized full-words system running on our test set.


doi: 10.21437/Interspeech.2009-123

Cite as: El-Desoky, A., Gollan, C., Rybach, D., Schlüter, R., Ney, H. (2009) Investigating the use of morphological decomposition and diacritization for improving Arabic LVCSR. Proc. Interspeech 2009, 2679-2682, doi: 10.21437/Interspeech.2009-123

@inproceedings{eldesoky09_interspeech,
  author={Amr El-Desoky and Christian Gollan and David Rybach and Ralf Schlüter and Hermann Ney},
  title={{Investigating the use of morphological decomposition and diacritization for improving Arabic LVCSR}},
  year=2009,
  booktitle={Proc. Interspeech 2009},
  pages={2679--2682},
  doi={10.21437/Interspeech.2009-123}
}