Third Workshop on Spoken Language Technologies for Under-resourced Languages

Cape Town, South Africa
May 7-9, 2012

Resource Development and Experiments in Automatic South African Broadcast News Transcription

Herman Kamper (1), Febe de Wet (1,2), Thomas Hain (3), Thomas Niesler (1)

(1) Department of Electrical and Electronic Engineering, Stellenbosch University, South Africa
(2) Human Language Technology Competency Area, CSIR Meraka Institute, Pretoria, South Africa
(3) Department of Computer Science, University of Sheffield, UK

We present a description of the development and evaluation of a first South African broadcast news transcription system. We describe a number of speech resources which have been collected in the resource-scarce South African environment for system development purposes: a 20 hour corpus of South African English (SAE) broadcast news; a 109M word corpus of South African newspaper text collected for language modelling purposes; and a 60k word SAE pronunciation dictionary. The development of our system is based on similar state-of-the-art broadcast news transcription systems and uses cross-word triphone HMMs, MF-PLP features and per-segment cepstral mean and per-bulletin cepstral variance normalisation. Our final system achieves a word error rate of 24.6%. We find that reasonable performance is achieved on newsreader speech while poor performance is achieved on spontaneous and telephone speech in our test data. Finally, we consider the recognition of MP3-compressed audio and show that performance deteriorates only at low bit-rates.

Index Terms: Broadcast news transcription, South African English, under-resourced languages, English accents

Full Paper

Bibliographic reference.  Kamper, Herman / Wet, Febe de / Hain, Thomas / Niesler, Thomas (2012): "Resource development and experiments in automatic south african broadcast news transcription", In SLTU-2012, 102-106.