Triplet Loss Based Cosine Similarity Metric Learning for Text-independent Speaker Recognition

Sergey Novoselov, Vadim Shchemelinin, Andrey Shulipa, Alexandr Kozlov, Ivan Kremnev


Deep neural network based speaker embeddings become increasingly popular in the text-independent speaker recognition task. In contrast to a generatively trained i-vector extractor, a DNN speaker embedding extractor is usually trained discriminatively in the closed set classification scenario using softmax. The problem we addressed in the paper is choosing a dnn based speaker embedding backend solution for speaker verification scoring. There are several options to perform speaker verification in the dnn embedding space. One of them is using a simple heuristic speaker similarity metric for the scoring (e.g. cosine metric). Similarly in the i-vector based systems, the standard Linear Discriminant Analisys (LDA) followed by the Probabilistic Linear Discriminant Analisys (PLDA) can be used for segregating speaker information. As an alternative, the discriminative metric learning approach can be considered. This work demonstrates that performance of deep speaker embeddings based systems can be improved by using Cosine Similarity Metric Learning (CSML) with the triplet loss training scheme. Results obtained on Speakers in the Wild and NIST SRE 2016 evaluation sets demonstrate superiority and robustness of CSML based systems.


 DOI: 10.21437/Interspeech.2018-1209

Cite as: Novoselov, S., Shchemelinin, V., Shulipa, A., Kozlov, A., Kremnev, I. (2018) Triplet Loss Based Cosine Similarity Metric Learning for Text-independent Speaker Recognition. Proc. Interspeech 2018, 2242-2246, DOI: 10.21437/Interspeech.2018-1209.


@inproceedings{Novoselov2018,
  author={Sergey Novoselov and Vadim Shchemelinin and Andrey Shulipa and Alexandr Kozlov and Ivan Kremnev},
  title={Triplet Loss Based Cosine Similarity Metric Learning for Text-independent Speaker Recognition},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={2242--2246},
  doi={10.21437/Interspeech.2018-1209},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1209}
}