Colloquialising Modern Standard Arabic Text for Improved Speech Recognition

Sarah Al-Shareef, Thomas Hain


Modern standard Arabic (MSA) is the official language of spoken and written Arabic media. Colloquial Arabic (CA) is the set of spoken variants of modern Arabic that exist in the form of regional dialects. CA is used in informal and everyday conversations while MSA is formal communication. An Arabic speaker switches between the two variants according to the situation. Developing an automatic speech recognition system always requires a large collection of transcribed speech or text, and for CA dialects this is an issue. CA has limited textual resources because it exists only as a spoken language, without a standardised written form unlike MSA. This paper focuses on the data sparsity issue in CA textual resources and proposes a strategy to emulate a native speaker in colloquialising MSA to be used in CA language models (LMs) by use of a machine translation (MT) framework. The empirical results in Levantine CA show that using LMs estimated from colloquialised MSA data outperformed MSA LMs with a perplexity reduction up to 68% relative. In addition, interpolating colloquialised MSA LMs with a CA LMs improved speech recognition performance by 4% relative.


DOI: 10.21437/Interspeech.2016-788

Cite as

Al-Shareef, S., Hain, T. (2016) Colloquialising Modern Standard Arabic Text for Improved Speech Recognition. Proc. Interspeech 2016, 1345-1349.

Bibtex
@inproceedings{Al-Shareef+2016,
author={Sarah Al-Shareef and Thomas Hain},
title={Colloquialising Modern Standard Arabic Text for Improved Speech Recognition},
year=2016,
booktitle={Interspeech 2016},
doi={10.21437/Interspeech.2016-788},
url={http://dx.doi.org/10.21437/Interspeech.2016-788},
pages={1345--1349}
}