Multi-Channel Training for End-to-End Speaker Recognition Under Reverberant and Noisy Environment

Danwei Cai, Xiaoyi Qin, Ming Li


Despite the significant improvements in speaker recognition enabled by deep neural networks, unsatisfactory performance persists under far-field scenarios due to the effects of the long range fading, room reverberation, and environmental noises. In this study, we focus on far-field speaker recognition with a microphone array. We propose a multi-channel training framework for the deep speaker embedding neural network on noisy and reverberant data. The proposed multi-channel training framework simultaneously processes the time-, frequency- and channel-information to learn a robust deep speaker embedding. Based on the 2-dimensional or 3-dimensional convolution layer, we investigate different multi-channel training schemes. Experiments on the simulated multi-channel reverberant and noisy data show that the proposed method obtains significant improvements over the single-channel trained deep speaker embedding system with front end speech enhancement or multi-channel embedding fusion.


 DOI: 10.21437/Interspeech.2019-1437

Cite as: Cai, D., Qin, X., Li, M. (2019) Multi-Channel Training for End-to-End Speaker Recognition Under Reverberant and Noisy Environment. Proc. Interspeech 2019, 4365-4369, DOI: 10.21437/Interspeech.2019-1437.


@inproceedings{Cai2019,
  author={Danwei Cai and Xiaoyi Qin and Ming Li},
  title={{Multi-Channel Training for End-to-End Speaker Recognition Under Reverberant and Noisy Environment}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={4365--4369},
  doi={10.21437/Interspeech.2019-1437},
  url={http://dx.doi.org/10.21437/Interspeech.2019-1437}
}