An End-to-End Text-Independent Speaker Identification System on Short Utterances

Ruifang Ji, Xinyuan Cai, Xu Bo


In the field of speaker recognition, text-independent speaker identification on short utterances is still a challenging task, since it is rather tough to extract a robust and dicriminative speaker feature in short duration condition. This paper explores an end-to-end speaker identification system, which maps utterances to a speaker identity subspace where the similarity of speakers can be measured by Euclidean distance. To be specific, we apply GRU architectures to extract utterance-level feature. Then it is assumed that one’s various utterances can be viewed as transformations of a single object in an ideal speaker identity subspace. Based on this assumption, the ResCNN architecture is utilized to model the transformation and the whole system is jointly optimized by speaker identity subspace loss. Experimental results demonstrate the effectiveness of our proposed system and superiority over pervious methods. For example, the GRU learned feature reduces the equal error rate by 27.53% relatively and the speaker identity subspace loss further brings 7.22% relative reduction compared to softmax loss.


 DOI: 10.21437/Interspeech.2018-1058

Cite as: Ji, R., Cai, X., Bo, X. (2018) An End-to-End Text-Independent Speaker Identification System on Short Utterances. Proc. Interspeech 2018, 3628-3632, DOI: 10.21437/Interspeech.2018-1058.


@inproceedings{Ji2018,
  author={Ruifang Ji and Xinyuan Cai and Xu Bo},
  title={An End-to-End Text-Independent Speaker Identification System on Short Utterances},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={3628--3632},
  doi={10.21437/Interspeech.2018-1058},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1058}
}