ISCA Archive Interspeech 2009

ASR corpus design for resource-scarce languages

Etienne Barnard, Marelie Davel, Charl van Heerden

We investigate the number of speakers and the amount of data that is required for the development of useable speaker-independent speech-recognition systems in resource-scarce languages. Our experiments employ the Lwazi corpus, which contains speech in the eleven official languages of South Africa. We find that a surprisingly small number of speakers (fewer than 50) and around 10 to 20 hours of speech per language are sufficient for the purposes of acceptable phone-based recognition.

doi: 10.21437/Interspeech.2009-727

Cite as: Barnard, E., Davel, M., Heerden, C.v. (2009) ASR corpus design for resource-scarce languages. Proc. Interspeech 2009, 2847-2850, doi: 10.21437/Interspeech.2009-727

