Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI

Daniel Povey, Vijayaditya Peddinti, Daniel Galvez, Pegah Ghahremani, Vimal Manohar, Xingyu Na, Yiming Wang, Sanjeev Khudanpur

In this paper we describe a method to perform sequence-discriminative training of neural network acoustic models without the need for frame-level cross-entropy pre-training. We use the lattice-free version of the maximum mutual information (MMI) criterion: LF-MMI. To make its computation feasible we use a phone n-gram language model, in place of the word language model. To further reduce its space and time complexity we compute the objective function using neural network outputs at one third the standard frame rate. These changes enable us to perform the computation for the forward-backward algorithm on GPUs. Further the reduced output frame-rate also provides a significant speed-up during decoding.

We present results on 5 different LVCSR tasks with training data ranging from 100 to 2100 hours. Models trained with LF-MMI provide a relative word error rate reduction of ~11.5%, over those trained with cross-entropy objective function, and ~8%, over those trained with cross-entropy and sMBR objective functions. A further reduction of ~2.5%, relative, can be obtained by fine tuning these models with the word-lattice based sMBR objective function.

DOI: 10.21437/Interspeech.2016-595

Cite as

Povey, D., Peddinti, V., Galvez, D., Ghahremani, P., Manohar, V., Na, X., Wang, Y., Khudanpur, S. (2016) Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI. Proc. Interspeech 2016, 2751-2755.

author={Daniel Povey and Vijayaditya Peddinti and Daniel Galvez and Pegah Ghahremani and Vimal Manohar and Xingyu Na and Yiming Wang and Sanjeev Khudanpur},
title={Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI},
booktitle={Interspeech 2016},