Multi-Stride Self-Attention for Speech Recognition

Kyu J. Han, Jing Huang, Yun Tang, Xiaodong He, Bowen Zhou

In contrast to the huge success of self-attention based neural networks in various NLP tasks, the efficacy of self-attention in speech applications is yet limited. This is partly because the full effectiveness of the self-attention mechanism could not be achieved without proper down-sampling schemes in speech tasks. To address this issue, we propose a new self-attention mechanism suitable for speech recognition, namely, multi-stride self-attention. The proposed multi-stride approach lets each group of heads in self-attention process speech frames with a unique stride over neighboring frames. Thus, the entire attention mechanism would not be confined in a fixed frame shift and can have diverse contextual views for a given frame to determine attention weights more effectively. To validate our proposal we evaluated it on various speech corpora for speech recognition, both English and Chinese, and observed a consistent improvement, especially in terms of substitution and deletion errors, without the increase of model complexity. The average WER improvement of 7.5% (relative) obtained by the TDNNs having the multi-stride self-attention layer as compared to the baseline TDNN model shows the effectiveness of the proposed multi-stride self-attention mechanism.

 DOI: 10.21437/Interspeech.2019-1973

Cite as: Han, K.J., Huang, J., Tang, Y., He, X., Zhou, B. (2019) Multi-Stride Self-Attention for Speech Recognition. Proc. Interspeech 2019, 2788-2792, DOI: 10.21437/Interspeech.2019-1973.

  author={Kyu J. Han and Jing Huang and Yun Tang and Xiaodong He and Bowen Zhou},
  title={{Multi-Stride Self-Attention for Speech Recognition}},
  booktitle={Proc. Interspeech 2019},