Speaker recognition (SR) is inevitably affected by noise in real-life scenarios, resulting in decreased recognition accuracy. In this paper, we introduce a novel regularization method, variable information bottleneck (VIB), in speaker recognition to extract robust speaker embeddings. VIB prompts the neural network to ignore as much speaker-identity irrelevant information as possible. We also propose a more effective network, VovNet with an ultra-lightweight subspace attention module (ULSAM), as a feature extractor. ULSAM infers different attention maps for each feature map subspace, enabling efficient learning of cross-channel information along with multi-scale and multi-frequency feature representation. The experimental results demonstrate that our proposed framework outperforms the ResNet-based baseline by 11.4% in terms of equal error rate (EER). The VIB regularization method gives a further performance boost with an 18.9% EER decrease.
Cite as: Wang, D., Dong, Y., Li, Y., Zi, Y., Zhang, Z., Li, X., Xiong, S. (2021) Variational Information Bottleneck Based Regularization for Speaker Recognition. Proc. Interspeech 2021, 1054-1058, doi: 10.21437/Interspeech.2021-482
@inproceedings{wang21j_interspeech, author={Dan Wang and Yuanjie Dong and Yaxing Li and Yunfei Zi and Zhihui Zhang and Xiaoqi Li and Shengwu Xiong}, title={{Variational Information Bottleneck Based Regularization for Speaker Recognition}}, year=2021, booktitle={Proc. Interspeech 2021}, pages={1054--1058}, doi={10.21437/Interspeech.2021-482} }