SLTU-2008 - First International Workshop on Spoken Languages Technologies for Under-Resourced Languages

Hanoi, Vietnam
May 5-7, 2008

Are Audio or Textual Training Data More Important for ASR in Less-Represented Languages?

Thomas Pellegrini, Lori Lamel

LIMSI-CNRS, Orsay, France

State-of-the-Art speech recognizers are typically trained on very large amounts of data, both transcribed speech and texts. With the recent growing interest in developing speech technologies for languages for which only small amounts of data are accessible, collecting appropriate data is a key issue in building new speech recognition systems. This article reports on an experimental study assessing the performance of a speech recognizer for a less-represented language, as a function of the quantity of texts and transcribed speech data available for model training. The experimental results show that for supervised training with only 2 hours of manually transcribed data, the acoustic models are the weak point. With 10 hours or more of transcribed audio data, the quantity of texts has a larger affect on the error rate than the quantity of speech.

Index Terms— Automatic speech recognition, lessrepresented languages, broadcast news transcription

Full Paper
Presentation (pdf)

Bibliographic reference.  Pellegrini, Thomas / Lamel, Lori (2008): "Are audio or textual training data more important for ASR in less-represented languages?", In SLTU-2008, 2-6.