Training Utterance-level Embedding Networks for Speaker Identification and Verification

Heewoong Park, Sukhyun Cho, Kyubyong Park, Namju Kim, Jonghun Park


Encoding speaker-specific characteristics from speech signals into fixed length vectors is a key component of speaker identification and verification systems. This paper presents a deep neural network architecture for speaker embedding models where similarity in embedded utterance vectors explicitly approximates the similarity in vocal patterns of speakers. The proposed architecture contains an additional speaker embedding lookup table to compute loss based on embedding similarities. Furthermore, we propose a new feature sampling method for data augmentation. Experimentation based on two databases demonstrates that our model is more effective at speaker identification and verification when compared to a fully connected classifier and an end-to-end verification model.


 DOI: 10.21437/Interspeech.2018-1044

Cite as: Park, H., Cho, S., Park, K., Kim, N., Park, J. (2018) Training Utterance-level Embedding Networks for Speaker Identification and Verification. Proc. Interspeech 2018, 3563-3567, DOI: 10.21437/Interspeech.2018-1044.


@inproceedings{Park2018,
  author={Heewoong Park and Sukhyun Cho and Kyubyong Park and Namju Kim and Jonghun Park},
  title={Training Utterance-level Embedding Networks for Speaker Identification and Verification},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={3563--3567},
  doi={10.21437/Interspeech.2018-1044},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1044}
}