ISCA Archive Interspeech 2021
ISCA Archive Interspeech 2021

MIMO Self-Attentive RNN Beamformer for Multi-Speaker Speech Separation

Xiyun Li, Yong Xu, Meng Yu, Shi-Xiong Zhang, Jiaming Xu, Bo Xu, Dong Yu

Recently, our proposed recurrent neural network (RNN) based all deep learning minimum variance distortionless response (ADL-MVDR) beamformer method yielded superior performance over the conventional MVDR by replacing the matrix inversion and eigenvalue decomposition with two RNNs. In this work, we present a self-attentive RNN beamformer to further improve our previous RNN-based beamformer by leveraging on the powerful modeling capability of self-attention. Temporal-spatial self-attention module is proposed to better learn the beamforming weights from the speech and noise spatial covariance matrices. The temporal self-attention module could help RNN to learn global statistics of covariance matrices. The spatial self-attention module is designed to attend on the cross-channel correlation in the covariance matrices. Furthermore, a multi-channel input with multi-speaker directional features and multi-speaker speech separation outputs (MIMO) model is developed to improve the inference efficiency. The evaluations demonstrate that our proposed MIMO self-attentive RNN beamformer improves both the automatic speech recognition (ASR) accuracy and the perceptual estimation of speech quality (PESQ) against prior arts.

doi: 10.21437/Interspeech.2021-570

Cite as: Li, X., Xu, Y., Yu, M., Zhang, S.-X., Xu, J., Xu, B., Yu, D. (2021) MIMO Self-Attentive RNN Beamformer for Multi-Speaker Speech Separation. Proc. Interspeech 2021, 1119-1123, doi: 10.21437/Interspeech.2021-570

  author={Xiyun Li and Yong Xu and Meng Yu and Shi-Xiong Zhang and Jiaming Xu and Bo Xu and Dong Yu},
  title={{MIMO Self-Attentive RNN Beamformer for Multi-Speaker Speech Separation}},
  booktitle={Proc. Interspeech 2021},