An Analysis of “Attention” in Sequence-to-Sequence Models

Rohit Prabhavalkar, Tara N. Sainath, Bo Li, Kanishka Rao, Navdeep Jaitly


In this paper, we conduct a detailed investigation of attention-based models for automatic speech recognition (ASR). First, we explore different types of attention, including “online” and “full-sequence” attention. Second, we explore different subword units to see how much of the end-to-end ASR process can reasonably be captured by an attention model. In experimental evaluations, we find that although attention is typically focused over a small region of the acoustics during each step of next label prediction, “full-sequence” attention outperforms “online” attention, although this gap can be significantly reduced by increasing the length of the segments over which attention is computed. Furthermore, we find that context-independent phonemes are a reasonable sub-word unit for attention models. When used in the second-pass to rescore N-best hypotheses, these models provide over a 10% relative improvement in word error rate.


 DOI: 10.21437/Interspeech.2017-232

Cite as: Prabhavalkar, R., Sainath, T.N., Li, B., Rao, K., Jaitly, N. (2017) An Analysis of “Attention” in Sequence-to-Sequence Models. Proc. Interspeech 2017, 3702-3706, DOI: 10.21437/Interspeech.2017-232.


@inproceedings{Prabhavalkar2017,
  author={Rohit Prabhavalkar and Tara N. Sainath and Bo Li and Kanishka Rao and Navdeep Jaitly},
  title={An Analysis of “Attention” in Sequence-to-Sequence Models},
  year=2017,
  booktitle={Proc. Interspeech 2017},
  pages={3702--3706},
  doi={10.21437/Interspeech.2017-232},
  url={http://dx.doi.org/10.21437/Interspeech.2017-232}
}