Survey Talk: When Attention Meets Speech Applications: Speech & Speaker Recognition Perspective

Kyu J. Han, Ramon Prieto, Tao Ma


Attention is to let neural layers pay more attention to what is relevant to a given task while giving less attention to what is less important, and since its introduction in 2015 for machine translation, has been successfully applied to speech applications in a number of different forms. This survey presents how the attention mechanisms have been applied to speech and speaker recognition tasks. The attention mechanism was firstly applied to sequence-to-sequence speech recognition and later became the critical part of Google’s well-known Listen, Attend and Spell ASR system. In the framework of hybrid DNN/HMM approaches or CTC-based ASR systems, the attention mechanisms recently started to get more traction in the form of self-attention. In a speaker recognition perspective, the attention mechanisms have been utilized to improve the capability of representing speaker characteristics in neural outputs, mostly in the form of attentive pooling. In this survey we detail the attentive strategies that have been successful in both speech and speaker recognition tasks, and discuss challenging issues in practice.


Cite as: Han, K.J., Prieto, R., Ma, T. (2019) Survey Talk: When Attention Meets Speech Applications: Speech & Speaker Recognition Perspective. Proc. Interspeech 2019.


@inproceedings{Han2019,
  author={Kyu J. Han and Ramon Prieto and Tao Ma},
  title={{Survey Talk: When Attention Meets Speech Applications: Speech & Speaker Recognition Perspective}},
  year=2019,
  booktitle={Proc. Interspeech 2019}
}