14thAnnual Conference of the International Speech Communication Association

Lyon, France
August 25-29, 2013

Modular Combination of Deep Neural Networks for Acoustic Modeling

Jonas Gehring (1), Wonkyum Lee (2), Kevin Kilgour (1), Ian Lane (2), Yajie Miao (2), Alex Waibel (1)

(1) KIT, Germany
(2) Carnegie Mellon University, USA

In this work, we propose a modular combination of two popular applications of neural networks to large-vocabulary continuous speech recognition. First, a deep neural network is trained to extract bottleneck features from frames of mel scale filterbank coefficients. In a similar way as is usually done for GMM/HMM systems, this network is then applied as a non-linear discriminative feature-space transformation for a hybrid setup where acoustic modeling is performed by a deep belief network. This effectively results in a very large network, where the layers of the bottleneck network are fixed and applied to successive windows of feature frames in a time-delay fashion. We show that bottleneck features improve the recognition performance of DBN/HMM hybrids, and that the modular combination enables the acoustic model to benefit from a larger temporal context. Our architecture is evaluated on a recently released and challenging Tagalog corpus containing conversational telephone speech.

Full Paper

Bibliographic reference.  Gehring, Jonas / Lee, Wonkyum / Kilgour, Kevin / Lane, Ian / Miao, Yajie / Waibel, Alex (2013): "Modular combination of deep neural networks for acoustic modeling", In INTERSPEECH-2013, 94-98.