A Comparison of Sequence-to-Sequence Models for Speech Recognition

Rohit Prabhavalkar, Kanishka Rao, Tara N. Sainath, Bo Li, Leif Johnson, Navdeep Jaitly


In this work, we conduct a detailed evaluation of various all-neural, end-to-end trained, sequence-to-sequence models applied to the task of speech recognition. Notably, each of these systems directly predicts graphemes in the written domain, without using an external pronunciation lexicon, or a separate language model. We examine several sequence-to-sequence models including connectionist temporal classification (CTC), the recurrent neural network (RNN) transducer, an attention-based model, and a model which augments the RNN transducer with an attention mechanism.

We find that the sequence-to-sequence models are competitive with traditional state-of-the-art approaches on dictation test sets, although the baseline, which uses a separate pronunciation and language model, outperforms these models on voice-search test sets.


 DOI: 10.21437/Interspeech.2017-233

Cite as: Prabhavalkar, R., Rao, K., Sainath, T.N., Li, B., Johnson, L., Jaitly, N. (2017) A Comparison of Sequence-to-Sequence Models for Speech Recognition. Proc. Interspeech 2017, 939-943, DOI: 10.21437/Interspeech.2017-233.


@inproceedings{Prabhavalkar2017,
  author={Rohit Prabhavalkar and Kanishka Rao and Tara N. Sainath and Bo Li and Leif Johnson and Navdeep Jaitly},
  title={A Comparison of Sequence-to-Sequence Models for Speech Recognition},
  year=2017,
  booktitle={Proc. Interspeech 2017},
  pages={939--943},
  doi={10.21437/Interspeech.2017-233},
  url={http://dx.doi.org/10.21437/Interspeech.2017-233}
}