ISCA Archive Interspeech 2009
ISCA Archive Interspeech 2009

Morphological analysis and decomposition for Arabic speech-to-text systems

F. Diehl, M. J. F. Gales, M. Tomalin, P. C. Woodland

Language modelling for a morphologically complex language such as Arabic is a challenging task. Its agglutinative structure results in data sparsity problems and high out-of-vocabulary rates. In this work these problems are tackled by applying the MADA tools to the Arabic text. In addition to morphological decomposition, MADA performs context-dependent stem-normalisation. Thus, if word-level system combination, or scoring, is required this normalisation must be reversed. To address this, a novel context-sensitive method for morpheme-to-word conversion is introduced. The performance of the MADA decomposed system was evaluated on an Arabic broadcast transcription task. The MADA-based system out-performed the word-based system, with both the morphological decomposition and stem normalisation being found to be important.


doi: 10.21437/Interspeech.2009-122

Cite as: Diehl, F., Gales, M.J.F., Tomalin, M., Woodland, P.C. (2009) Morphological analysis and decomposition for Arabic speech-to-text systems. Proc. Interspeech 2009, 2675-2678, doi: 10.21437/Interspeech.2009-122

@inproceedings{diehl09_interspeech,
  author={F. Diehl and M. J. F. Gales and M. Tomalin and P. C. Woodland},
  title={{Morphological analysis and decomposition for Arabic speech-to-text systems}},
  year=2009,
  booktitle={Proc. Interspeech 2009},
  pages={2675--2678},
  doi={10.21437/Interspeech.2009-122}
}