10th Annual Conference of the International Speech Communication Association

Brighton, United Kingdom
September 6-10, 2009

ASR Corpus Design for Resource-Scarce Languages

Etienne Barnard, Marelie Davel, Charl van Heerden

CSIR, South Africa

We investigate the number of speakers and the amount of data that is required for the development of useable speaker-independent speech-recognition systems in resource-scarce languages. Our experiments employ the Lwazi corpus, which contains speech in the eleven official languages of South Africa. We find that a surprisingly small number of speakers (fewer than 50) and around 10 to 20 hours of speech per language are sufficient for the purposes of acceptable phone-based recognition.

Full Paper

Bibliographic reference.  Barnard, Etienne / Davel, Marelie / Heerden, Charl van (2009): "ASR corpus design for resource-scarce languages", In INTERSPEECH-2009, 2847-2850.