15th Annual Conference of the International Speech Communication Association

September 14-18, 2014

Autoregressive Product of Multi-Frame Predictions Can Improve the Accuracy of Hybrid Models

Navdeep Jaitly (1), Vincent Vanhoucke (2), Geoffrey Hinton (1)

(1) University of Toronto, Canada
(2) Google, USA

We describe a simple but effective way of using multi-frame targets to improve the accuracy of Artificial Neural Network-Hidden Markov Model (ANN-HMM) hybrid systems. In this approach a Deep Neural Network (DNN) is trained to predict the forced-alignment state of multiple frames using a separate softmax for each of the frames. This is in contrast to the usual method of training a DNN to predict only the state of the central frame. By itself this is not sufficient to improve accuracy of the system significantly. However, if we average the predictions for each frame — from the different contexts it is associated with — we achieve state of the art results on TIMIT using a fully connected Deep Neural Network without convolutional architectures or dropout training. On a 14 hour subset of Wall Street Journal (WSJ) using a context dependent DNN-HMM system it leads to a relative improvement of 6.4% on the dev set ( test-dev93) and 9.3% on test set ( test-eval92).

Full Paper

Bibliographic reference.  Jaitly, Navdeep / Vanhoucke, Vincent / Hinton, Geoffrey (2014): "Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models", In INTERSPEECH-2014, 1905-1909.