15th Annual Conference of the International Speech Communication Association

September 14-18, 2014

Unfolded Recurrent Neural Networks for Speech Recognition

George Saon, Hagen Soltau, Ahmad Emami, Michael Picheny

IBM T.J. Watson Research Center, USA

We introduce recurrent neural networks (RNNs) for acoustic modeling which are unfolded in time for a fixed number of time steps. The proposed models are feedforward networks with the property that the unfolded layers which correspond to the recurrent layer have time-shifted inputs and tied weight matrices. Besides the temporal depth due to unfolding, hierarchical processing depth is added by means of several non-recurrent hidden layers inserted between the unfolded layers and the output layer. The training of these models: (a) has a complexity that is comparable to deep neural networks (DNNs) with the same number of layers; (b) can be done on frame-randomized minibatches; (c) can be implemented efficiently through matrix-matrix operations on GPU architectures which makes it scalable for large tasks. Experimental results on the Switchboard 300 hours English conversational telephony task show a 5% relative improvement in word error rate over state-of-the-art DNNs trained on FMLLR features with i-vector speaker adaptation and hessian-free sequence discriminative training.

Full Paper

Bibliographic reference.  Saon, George / Soltau, Hagen / Emami, Ahmad / Picheny, Michael (2014): "Unfolded recurrent neural networks for speech recognition", In INTERSPEECH-2014, 343-347.