Third Workshop on Spoken Language Technologies for Under-resourced Languages

Cape Town, South Africa
May 7-9, 2012

Developments of Swahili Resources for an Automatic Speech Recognition System

Hadrien Gelas (1,2), Laurent Besacier (2), François Pellegrino (1)

(1) Laboratoire Dynamique Du Langage, CNRS - Université de Lyon, France
(2) Laboratoire Informatique de Grenoble, CNRS - Université Joseph Fourier, Grenoble, France

This article describes our efforts to provide ASR resources for Swahili, a Bantu language spoken in a wide area of East Africa. We start with an introduction on the language situation, both at linguistic and digital level. Then, we report the selected strategies to develop a text corpus, a pronunciation dictionary and a speech corpus for this under-resourced language. We explore methodologies as crowdsourcing or collaborative transcription process. Besides, we take advantage of some linguistic characteristics of the language such as rich morphology or shared vocabulary with English to improve performance of our baseline Swahili ASR system in a broadcast speech transcription task.

Index Terms: Swahili, under-resourced languages, automatic speech recognition, speech resources

Full Paper

Bibliographic reference.  Gelas, Hadrien / Besacier, Laurent / Pellegrino, François (2012): "Developments of Swahili resources for an automatic speech recognition system", In SLTU-2012, 94-101.