An Analysis of Local Monotonic Attention Variants

André Merboldt, Albert Zeyer, Ralf Schlüter, Hermann Ney


Speech recognition using attention-based models is an effective approach to transcribing audio directly to text within an integrated end-to-end architecture. Global attention approaches compute a weighting over the complete input sequence, whereas local attention mechanisms are restricted to only a localized window of the sequence. For speech, the latter approach supports the monotonicity property of the speech-text alignment. Therefore, we revise several variants of such models and provide a comprehensive comparison, which has been missing so far in the literature. Additionally, we introduce a simple technique to implement windowed attention. This can be applied on top of an existing global attention model. The goal is to transition into a local attention model, by using a local window for the otherwise unchanged attention mechanism, starting from the temporal position with the most recent most active attention energy. We test this method on Switchboard and LibriSpeech and show that the proposed model can even be trained from random initialization and achieve results comparable to the global attention baseline.


 DOI: 10.21437/Interspeech.2019-2879

Cite as: Merboldt, A., Zeyer, A., Schlüter, R., Ney, H. (2019) An Analysis of Local Monotonic Attention Variants. Proc. Interspeech 2019, 1398-1402, DOI: 10.21437/Interspeech.2019-2879.


@inproceedings{Merboldt2019,
  author={André Merboldt and Albert Zeyer and Ralf Schlüter and Hermann Ney},
  title={{An Analysis of Local Monotonic Attention Variants}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={1398--1402},
  doi={10.21437/Interspeech.2019-2879},
  url={http://dx.doi.org/10.21437/Interspeech.2019-2879}
}