Third Workshop on Spoken Language Technologies for Under-resourced Languages
Cape Town, South Africa
This article describes our efforts to provide ASR resources for Swahili, a Bantu language spoken in a wide area of East Africa. We start with an introduction on the language situation, both at linguistic and digital level. Then, we report the selected strategies to develop a text corpus, a pronunciation dictionary and a speech corpus for this under-resourced language. We explore methodologies as crowdsourcing or collaborative transcription process. Besides, we take advantage of some linguistic characteristics of the language such as rich morphology or shared vocabulary with English to improve performance of our baseline Swahili ASR system in a broadcast speech transcription task.
Index Terms: Swahili, under-resourced languages, automatic speech recognition, speech resources
Bibliographic reference. Gelas, Hadrien / Besacier, Laurent / Pellegrino, François (2012): "Developments of Swahili resources for an automatic speech recognition system", In SLTU-2012, 94-101.