Deep Discriminative Embeddings for Duration Robust Speaker Verification

Na Li, Deyi Tuo, Dan Su, Zhifeng Li, Dong Yu


The embedding-based deep convolution neural networks (CNNs) have demonstrated effective for text-independent speaker verification systems with short utterances. However, the duration robustness of the existing deep CNNs based algorithms has not been investigated when dealing with utterances of arbitrary duration. To improve robustness of embedding-based deep CNNs for longer duration utterances, we propose a novel algorithm to learn more discriminative utterance-level embeddings based on the Inception-ResNet speaker classifier. Specifically, the discriminability of embeddings is enhanced by reducing intra-speaker variation with center loss and simultaneously increasing inter-speaker discrepancy with softmax loss. To further improve system performance when long utterances are available, at test stage long utterances are segmented into shorter ones, where utterance-level speaker embeddings are extracted by an average pooling layer. Experimental results show that when cosine distance is employed as the measure of similarity for a trial, the proposed method outperforms ivector/PLDA framework for short utterances and is effective for long utterances.


 DOI: 10.21437/Interspeech.2018-1769

Cite as: Li, N., Tuo, D., Su, D., Li, Z., Yu, D. (2018) Deep Discriminative Embeddings for Duration Robust Speaker Verification. Proc. Interspeech 2018, 2262-2266, DOI: 10.21437/Interspeech.2018-1769.


@inproceedings{Li2018,
  author={Na Li and Deyi Tuo and Dan Su and Zhifeng Li and Dong Yu},
  title={Deep Discriminative Embeddings for Duration Robust Speaker Verification},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={2262--2266},
  doi={10.21437/Interspeech.2018-1769},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1769}
}