Deeply Fused Speaker Embeddings for Text-Independent Speaker Verification

Gautam Bhattacharya, Md Jahangir Alam, Vishwa Gupta, Patrick Kenny


Recently there has been a surge of interest is learning speaker embeddings using deep neural networks. These models ingest time-frequency representations of speech and can be trained to discriminate between a known set speakers. While embeddings learned in this way perform well, they typically require a large number of training data points for learning. In this work we propose deeply fused speaker embeddings - speaker representations that combine neural speaker embeddings with i-vectors. We show that by combining the two speaker representations we are able to learn robust speaker embeddings in a computationally efficient manner. We compare several different fusion strategies and find that the resulting speaker embeddings show significantly different verification performance. To this end we propose a novel fusion approach that uses an attention model to combine i-vectors with neural speaker embeddings. Our best performing embedding achieves an error rate of 3.17% using a simple cosine distance classifier. Combining our embeddings with a powerful Joint Bayesian classifier, we are able to further improve the performance of our speaker embeddings to 2.22%, which gave a 7.8% relative improvement over the baseline i-vector system.


 DOI: 10.21437/Interspeech.2018-1688

Cite as: Bhattacharya, G., Alam, M.J., Gupta, V., Kenny, P. (2018) Deeply Fused Speaker Embeddings for Text-Independent Speaker Verification. Proc. Interspeech 2018, 3588-3592, DOI: 10.21437/Interspeech.2018-1688.


@inproceedings{Bhattacharya2018,
  author={Gautam Bhattacharya and Md Jahangir Alam and Vishwa Gupta and Patrick Kenny},
  title={Deeply Fused Speaker Embeddings for Text-Independent Speaker Verification},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={3588--3592},
  doi={10.21437/Interspeech.2018-1688},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1688}
}