Monaural Multi-Talker Speech Recognition with Attention Mechanism and Gated Convolutional Networks

Xuankai Chang, Yanmin Qian, Dong Yu


To improve the speech recognition accuracy under the multi-talker scenario, we propose a novel model architecture that incorporates the attention mechanism and gated convolutional network (GCN) into our previously developed permutation invariant training based multi-talker speech recognition system (PIT-ASR). The new architecture has three components: an encoding transformer, an attention module and a frame-level senone predictor. The encoding transformer first transforms a mixed speech sequence into a sequence of embedding vectors. Then the attention mechanism extracts individual context vectors from this embedding sequence for different speaker sources. Finally the predictor generates the senone posteriors for all speaker sources independently with the knowledge from the context vectors. To get better embedding representations we explore gated convolutional networks in the encoding transformer. The experimental results on the artificially mixed two-talker WSJ0 corpus show that our proposed model can reduce the word error rate (WER) by more than 15% relatively compared to our previous PIT-ASR system.


 DOI: 10.21437/Interspeech.2018-1547

Cite as: Chang, X., Qian, Y., Yu, D. (2018) Monaural Multi-Talker Speech Recognition with Attention Mechanism and Gated Convolutional Networks. Proc. Interspeech 2018, 1586-1590, DOI: 10.21437/Interspeech.2018-1547.


@inproceedings{Chang2018,
  author={Xuankai Chang and Yanmin Qian and Dong Yu},
  title={Monaural Multi-Talker Speech Recognition with Attention Mechanism and Gated Convolutional Networks},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={1586--1590},
  doi={10.21437/Interspeech.2018-1547},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1547}
}