Analysis of Complementary Information Sources in the Speaker Embeddings Framework

Mahesh Kumar Nandwana, Mitchell McLaren, Diego Castan, Julien van Hout, Aaron Lawson


Deep neural network (DNN)-based speaker embeddings have resulted in new, state-of-the-art text-independent speaker recognition technology. However, very limited effort has been made to understand DNN speaker embeddings. In this study, our aim is analyzing the behavior of the speaker recognition systems based on speaker embeddings toward different front-end features, including the standard Mel frequency cepstral coefficients (MFCC), as well as power normalized cepstral coefficients (PNCC) and perceptual linear prediction (PLP). Using a speaker recognition system based on DNN speaker embeddings and probabilistic linear discriminant analysis (PLDA), we compared different approaches to leveraging complementary information using score-, embeddings- and feature-level combination. We report our results for Speakers in the Wild (SITW) and NIST SRE 2016 datasets. We found that first and second embeddings layers are complementary in nature. By applying score and embedding-level fusion we demonstrate relative improvements in equal error rate of 17% on NIST SRE 2016 and 10% on SITW over the baseline system.


 DOI: 10.21437/Interspeech.2018-1102

Cite as: Nandwana, M.K., McLaren, M., Castan, D., van Hout, J., Lawson, A. (2018) Analysis of Complementary Information Sources in the Speaker Embeddings Framework. Proc. Interspeech 2018, 3568-3572, DOI: 10.21437/Interspeech.2018-1102.


@inproceedings{Nandwana2018,
  author={Mahesh Kumar Nandwana and Mitchell McLaren and Diego Castan and Julien {van Hout} and Aaron Lawson},
  title={Analysis of Complementary Information Sources in the Speaker Embeddings Framework},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={3568--3572},
  doi={10.21437/Interspeech.2018-1102},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1102}
}