SLTU-2008 - First International Workshop on Spoken Languages Technologies for Under-Resourced Languages

Hanoi, Vietnam
May 5-7, 2008

The Systematic Collection of Speech Corpora for all Eleven Official South African Languages

Marissa van Rooyen, Cecile van Zyl, Nico Oosthuizen

Centre for Text Technology (CTexT), North-West University (Potchefstroom Campus), Potchefstroom, South Africa

In this paper we outline the methods and best practices when collecting speech data for under-resourced languages. The focus of this discussion is on showing ways of improving the quality of the collection and turnaround time. This paper shows how to deal with matters concerning assistants and technical problems, as well as suggesting ways in which data management may be optimised with the use of certain techniques. This article aims at providing the reader with a total overview of improvements made during the course of a real data collection project with tangible problems and results.

Full Paper
Presentation (pdf)

Bibliographic reference.  Rooyen, Marissa van / Zyl, Cecile van / Oosthuizen, Nico (2008): "The systematic collection of speech corpora for all eleven official South african languages", In SLTU-2008, 58-62.