Arabic is a language with great dialectal variety, with Modern Standard
Arabic (MSA) being the only standardized dialect. Spoken Arabic is
characterized by frequent code-switching between MSA and Dialectal
Arabic (DA). DA varieties are typically differentiated by region, but
despite their wide-spread usage, they are under-resourced and lack
viable corpora and tools necessary for speech recognition and natural
language processing. Existing DA speech corpora are limited in scope,
consisting of mainly telephone conversations and scripted speech.
In this paper we describe our efforts for using crowdsourcing to create a labeled multi-dialectal speech corpus. We obtained utterance-level dialect labels for 57 hours of high-quality audio from Al Jazeera consisting of four major varieties of DA: Egyptian, Levantine, Gulf, and North African. Using speaker linking to identify utterances spoken by the same speaker, and measures of label accuracy likelihood based on annotator behavior, we automatically labeled an additional 94 hours. The complete corpus contains 850 hours with approximately 18% DA speech.
Bibliographic reference. Wray, Samantha / Ali, Ahmed (2015): "Crowdsource a little to label a lot: labeling a speech corpus of dialectal Arabic", In INTERSPEECH-2015, 2824-2828.