8th International Conference on Spoken Language Processing

Jeju Island, Korea
October 4-8, 2004

Transcription of Arabic Broadcast News

Abdel. Messaoudi (1), Lori Lamel (2), Jean-Luc Gauvain (2)

(1) CNRS-LIMSI, France
(2) Spoken Language Processing Group, France

This paper describes recent research on transcribing Modern Standard Arabic broadcast news data. The Arabic language presents a number of challenges for speech recognition, arising in part from the significant differences in the spoken and written forms, in particular the conventional form of texts being non-vowelized. Arabic is a highly inflected language where articles and affixes are added to roots in order to change the word's meaning. A corpus of 50 hours of audio data from 7 television and radio sources and 200M words of newspaper texts were used to train the acoustic and language models. The transcription system based on these models and a vowelized dictionary obtains an average word error rate on a test set comprised of 12 hours of test data from 8 sources is about 18%.

Full Paper

Bibliographic reference.  Messaoudi, Abdel. / Lamel, Lori / Gauvain, Jean-Luc (2004): "Transcription of arabic broadcast news", In INTERSPEECH-2004, 1701-1704.