15th Annual Conference of the International Speech Communication Association

September 14-18, 2014

Combining Tandem and Hybrid Systems for Improved Speech Recognition and Keyword Spotting on Low Resource Languages

Shakti P. Rath, Kate M. Knill, Anton Ragni, Mark J. F. Gales

University of Cambridge, UK

In recent years there has been significant interest in Automatic Speech Recognition (ASR) and KeyWord Spotting (KWS) systems for low resource languages. One of the driving forces for this research direction is the IARPA Babel project. This paper examines the performance gains that can be obtained by combining two forms of deep neural network ASR systems, Tandem and Hybrid, for both ASR and KWS using data released under the Babel project. Baseline systems are described for the five option period 1 languages: Assamese; Bengali; Haitian Creole; Lao; and Zulu. All the ASR systems share common attributes, for example deep neural network configurations, and decision trees based on rich phonetic questions and state-position root nodes. The baseline ASR and KWS performance of Hybrid and Tandem systems are compared for both the “full”, approximately 80 hours of training data, and limited, approximately 10 hours of training data, language packs. By combining the two systems together consistent performance gains can be obtained for KWS in all configurations.

Full Paper

Bibliographic reference.  Rath, Shakti P. / Knill, Kate M. / Ragni, Anton / Gales, Mark J. F. (2014): "Combining tandem and hybrid systems for improved speech recognition and keyword spotting on low resource languages", In INTERSPEECH-2014, 835-839.