ISCA Archive SLTU 2014
ISCA Archive SLTU 2014

The NCHLT speech corpus of the South African languages

Etienne Barnard, Marelie H. Davel, Charl van Heerden, Febe de Wet, Jaco Badenhorst

The NCHLT speech corpus contains wide-band speech from approximately 200 speakers per language, in each of the eleven official languages of South Africa. We describe the design and development processes that were undertaken in order to develop the corpus, and report on associated materials such as orthographic transcriptions and pronunciation dictionaries that were released as part of the corpus. In order to benchmark speechrecognition performance on the corpus, we have also developed both phone-recognition and word-recognition systems for all eleven languages; we find that high accuracies can be achieved for these speaker-independent but vocabulary-dependent recognition tasks in all languages.


Cite as: Barnard, E., Davel, M.H., Heerden, C.v., Wet, F.d., Badenhorst, J. (2014) The NCHLT speech corpus of the South African languages. Proc. 4th Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU 2014), 194-200

@inproceedings{barnard14_sltu,
  author={Etienne Barnard and Marelie H. Davel and Charl van Heerden and Febe de Wet and Jaco Badenhorst},
  title={{The NCHLT speech corpus of the South African languages}},
  year=2014,
  booktitle={Proc. 4th Workshop on Spoken Language Technologies for Under-Resourced Languages  (SLTU 2014)},
  pages={194--200}
}