Text-independent speaker verification against short utterances is still challenging despite of recent advances in the field of speaker recognition with i-vector framework. In general, to get a robust i-vector representation, a satisfying amount of data is needed in the MAP adaptation step, which is hard to meet under short duration constraint. To overcome this, we present an end-to-end system which directly learns a mapping from speech features to a compact fixed length speaker discriminative embedding where the Euclidean distance is employed for measuring similarity within trials. To learn the feature mapping, a modified Inception Net with residual block is proposed to optimize the triplet loss function. The input of our end-to-end system is a fixed length spectrogram converted from an arbitrary length utterance. Experiments show that our system consistently outperforms a conventional i-vector system on short duration speaker verification tasks. To test the limit under various duration conditions, we also demonstrate how our end-to-end system behaves with different duration from 2s–4s.
Cite as: Zhang, C., Koishida, K. (2017) End-to-End Text-Independent Speaker Verification with Triplet Loss on Short Utterances. Proc. Interspeech 2017, 1487-1491, doi: 10.21437/Interspeech.2017-1608
@inproceedings{zhang17d_interspeech, author={Chunlei Zhang and Kazuhito Koishida}, title={{End-to-End Text-Independent Speaker Verification with Triplet Loss on Short Utterances}}, year=2017, booktitle={Proc. Interspeech 2017}, pages={1487--1491}, doi={10.21437/Interspeech.2017-1608} }