End-to-End Text-Independent Speaker Verification with Triplet Loss on Short Utterances

Chunlei Zhang, Kazuhito Koishida


Text-independent speaker verification against short utterances is still challenging despite of recent advances in the field of speaker recognition with i-vector framework. In general, to get a robust i-vector representation, a satisfying amount of data is needed in the MAP adaptation step, which is hard to meet under short duration constraint. To overcome this, we present an end-to-end system which directly learns a mapping from speech features to a compact fixed length speaker discriminative embedding where the Euclidean distance is employed for measuring similarity within trials. To learn the feature mapping, a modified Inception Net with residual block is proposed to optimize the triplet loss function. The input of our end-to-end system is a fixed length spectrogram converted from an arbitrary length utterance. Experiments show that our system consistently outperforms a conventional i-vector system on short duration speaker verification tasks. To test the limit under various duration conditions, we also demonstrate how our end-to-end system behaves with different duration from 2s–4s.


 DOI: 10.21437/Interspeech.2017-1608

Cite as: Zhang, C., Koishida, K. (2017) End-to-End Text-Independent Speaker Verification with Triplet Loss on Short Utterances. Proc. Interspeech 2017, 1487-1491, DOI: 10.21437/Interspeech.2017-1608.


@inproceedings{Zhang2017,
  author={Chunlei Zhang and Kazuhito Koishida},
  title={End-to-End Text-Independent Speaker Verification with Triplet Loss on Short Utterances},
  year=2017,
  booktitle={Proc. Interspeech 2017},
  pages={1487--1491},
  doi={10.21437/Interspeech.2017-1608},
  url={http://dx.doi.org/10.21437/Interspeech.2017-1608}
}