10th Annual Conference of the International Speech Communication Association

Brighton, United Kingdom
September 6-10, 2009

Morphological Analysis and Decomposition for Arabic Speech-to-Text Systems

F. Diehl, M. J. F. Gales, M. Tomalin, P. C. Woodland

University of Cambridge, UK

Language modelling for a morphologically complex language such as Arabic is a challenging task. Its agglutinative structure results in data sparsity problems and high out-of-vocabulary rates. In this work these problems are tackled by applying the MADA tools to the Arabic text. In addition to morphological decomposition, MADA performs context-dependent stem-normalisation. Thus, if word-level system combination, or scoring, is required this normalisation must be reversed. To address this, a novel context-sensitive method for morpheme-to-word conversion is introduced. The performance of the MADA decomposed system was evaluated on an Arabic broadcast transcription task. The MADA-based system out-performed the word-based system, with both the morphological decomposition and stem normalisation being found to be important.

Full Paper

Bibliographic reference.  Diehl, F. / Gales, M. J. F. / Tomalin, M. / Woodland, P. C. (2009): "Morphological analysis and decomposition for Arabic speech-to-text systems", In INTERSPEECH-2009, 2675-2678.