Deep Learning Based Multi-Channel Speaker Recognition in Noisy and Reverberant Environments

Hassan Taherian, Zhong-Qiu Wang, DeLiang Wang


Despite successful applications of multi-channel signal processing in robust automatic speech recognition (ASR), relatively little research has been conducted on the effectiveness of such techniques in the robust speaker recognition domain. This paper introduces time-frequency (T-F) masking-based beamforming to address text-independent speaker recognition in conditions where strong diffuse noise and reverberation are both present. We examine various masking-based beamformers, such as parameterized multi-channel Wiener filter, generalized eigenvalue (GEV) beamformer and minimum variance distortion-less response (MVDR) beamformer, and evaluate their performance in terms of speaker recognition accuracy for i-vector and x-vector based systems. In addition, we present a different formulation for estimating steering vectors from speech covariance matrices. We show that rank-1 approximation of a speech covariance matrix based on generalized eigenvalue decomposition leads to the best results for the masking-based MVDR beamformer. Experiments on the recently introduced NIST SRE 2010 retransmitted corpus show that the MVDR beamformer with rank-1 approximation provides an absolute reduction of 5.55% in equal error rate compared to a standard masking-based MVDR beamformer.


 DOI: 10.21437/Interspeech.2019-1428

Cite as: Taherian, H., Wang, Z., Wang, D. (2019) Deep Learning Based Multi-Channel Speaker Recognition in Noisy and Reverberant Environments. Proc. Interspeech 2019, 4070-4074, DOI: 10.21437/Interspeech.2019-1428.


@inproceedings{Taherian2019,
  author={Hassan Taherian and Zhong-Qiu Wang and DeLiang Wang},
  title={{Deep Learning Based Multi-Channel Speaker Recognition in Noisy and Reverberant Environments}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={4070--4074},
  doi={10.21437/Interspeech.2019-1428},
  url={http://dx.doi.org/10.21437/Interspeech.2019-1428}
}