We introduce an encoder-decoder recurrent neural network model called Recurrent Neural Aligner (RNA) that can be used for sequence to sequence mapping tasks. Like connectionist temporal classification (CTC) models, RNA defines a probability distribution over target label sequences including blank labels corresponding to each time step in input. The probability of a label sequence is calculated by marginalizing over all possible blank label positions. Unlike CTC, RNA does not make a conditional independence assumption for label predictions; it uses the predicted label at time t-1 as an additional input to the recurrent model when predicting the label at time t. We apply this model to end-to-end speech recognition. RNA is capable of streaming recognition since the decoder does not employ attention mechanism. The model is trained on transcribed acoustic data to predict graphemes and no external language and pronunciation models are used for decoding. We employ an approximate dynamic programming method to optimize negative log likelihood, and a sampling-based sequence discriminative training technique to fine-tune the model to minimize expected word error rate. We show that the model achieves competitive accuracy without using an external language model nor doing beam search decoding.
Cite as: Sak, H., Shannon, M., Rao, K., Beaufays, F. (2017) Recurrent Neural Aligner: An Encoder-Decoder Neural Network Model for Sequence to Sequence Mapping. Proc. Interspeech 2017, 1298-1302, doi: 10.21437/Interspeech.2017-1705
@inproceedings{sak17_interspeech, author={Haşim Sak and Matt Shannon and Kanishka Rao and Françoise Beaufays}, title={{Recurrent Neural Aligner: An Encoder-Decoder Neural Network Model for Sequence to Sequence Mapping}}, year=2017, booktitle={Proc. Interspeech 2017}, pages={1298--1302}, doi={10.21437/Interspeech.2017-1705} }