Self-Attention Transducers for End-to-End Speech Recognition

Zhengkun Tian, Jiangyan Yi, Jianhua Tao, Ye Bai, Zhengqi Wen


Recurrent neural network transducers (RNN-T) have been successfully applied in end-to-end speech recognition. However, the recurrent structure makes it difficult for parallelization. In this paper, we propose a self-attention transducer (SA-T) for speech recognition. RNNs are replaced with self-attention blocks, which are powerful to model long-term dependencies inside sequences and able to be efficiently parallelized. Furthermore, a path-aware regularization is proposed to assist SA-T to learn alignments and improve the performance. Additionally, a chunk-flow mechanism is utilized to achieve online decoding. All experiments are conducted on a Mandarin Chinese dataset AISHELL-1. The results demonstrate that our proposed approach achieves a 21.3% relative reduction in character error rate compared with the baseline RNN-T. In addition, the SA-T with chunk-flow mechanism can perform online decoding with only a little degradation of the performance.


 DOI: 10.21437/Interspeech.2019-2203

Cite as: Tian, Z., Yi, J., Tao, J., Bai, Y., Wen, Z. (2019) Self-Attention Transducers for End-to-End Speech Recognition. Proc. Interspeech 2019, 4395-4399, DOI: 10.21437/Interspeech.2019-2203.


@inproceedings{Tian2019,
  author={Zhengkun Tian and Jiangyan Yi and Jianhua Tao and Ye Bai and Zhengqi Wen},
  title={{Self-Attention Transducers for End-to-End Speech Recognition}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={4395--4399},
  doi={10.21437/Interspeech.2019-2203},
  url={http://dx.doi.org/10.21437/Interspeech.2019-2203}
}