15th Annual Conference of the International Speech Communication Association

September 14-18, 2014

Modeling Long Temporal Contexts for Robust DNN-Based Speech Recognition

Bo Li, Khe Chai Sim

National University of Singapore, Singapore

Deep Neural Networks (DNNs) have been shown to outperform traditional Gaussian Mixture Models in many Automatic Speech Recognition tasks. In this work, we investigate the potential of modeling long temporal acoustic contexts using DNNs. The complete temporal context is split into several sub-contexts. Multiple sub-context DNNs initialized with the same set of Restricted Boltzmann Machines are fine-tuned independently and their last hidden layer activations are combined to jointly predict the desired state posteriors through a single softmax output layer. From preliminary experiments on the Aurora2 multi-style training task, our proposed system models a 65-frame temporal window of speech signals and yields a 4.4% WER, outperforming the best single DNN by 12.0% relatively. With the local independence assumption, both training and testing of the sub-context DNNs can be done in parallel. Moreover, our system has a relative 48.2% parameter reduction compared to a single DNN with the same amount of hidden units.

Full Paper

Bibliographic reference.  Li, Bo / Sim, Khe Chai (2014): "Modeling long temporal contexts for robust DNN-based speech recognition", In INTERSPEECH-2014, 353-357.