INTERSPEECH 2004 - ICSLP
This paper describes recent research on transcribing Modern Standard Arabic broadcast news data. The Arabic language presents a number of challenges for speech recognition, arising in part from the significant differences in the spoken and written forms, in particular the conventional form of texts being non-vowelized. Arabic is a highly inflected language where articles and affixes are added to roots in order to change the word's meaning. A corpus of 50 hours of audio data from 7 television and radio sources and 200M words of newspaper texts were used to train the acoustic and language models. The transcription system based on these models and a vowelized dictionary obtains an average word error rate on a test set comprised of 12 hours of test data from 8 sources is about 18%.
Bibliographic reference. Messaoudi, Abdel. / Lamel, Lori / Gauvain, Jean-Luc (2004): "Transcription of arabic broadcast news", In INTERSPEECH-2004, 1701-1704.