15th Annual Conference of the International Speech Communication Association

September 14-18, 2014

Exploiting Vocal-Source Features to Improve ASR Accuracy for Low-Resource Languages

Raul Fernandez (1), Jia Cui (1), Andrew Rosenberg (2), Bhuvana Ramabhadran (1), Xiaodong Cui (1)

(1) IBM T.J. Watson Research Center, USA
(2) CUNY Queens College, USA

A traditional framework in speech production describes the output speech as an interaction between a source excitation and a vocal-tract configured by the speaker to impart segmental characteristics. In general, this simplification has led to approaches where systems that focus on phonetic segment tasks (e.g. speech recognition) make use of a front-end that extracts features that aim to distinguish between different vocal-tract configurations. The excitation signal, on the other hand, has received more attention for speaker-characterization tasks. In this work we look at augmenting the front-end in a recognition system with vocal-source features, motivated by our work with languages that are low in resources and whose phonology and phonetics suggest the need for complementary approaches to classical ASR features. We demonstrate that the additional use of such features provides improvements over a state-of-the-art system for low-resource languages from the BABEL Program.

Full Paper

Bibliographic reference.  Fernandez, Raul / Cui, Jia / Rosenberg, Andrew / Ramabhadran, Bhuvana / Cui, Xiaodong (2014): "Exploiting vocal-source features to improve ASR accuracy for low-resource languages", In INTERSPEECH-2014, 805-809.