Self-Attentive Speaker Embeddings for Text-Independent Speaker Verification

Yingke Zhu, Tom Ko, David Snyder, Brian Mak, Daniel Povey


This paper introduces a new method to extract speaker embeddings from a deep neural network (DNN) for text-independent speaker verification. Usually, speaker embeddings are extracted from a speaker-classification DNN that averages the hidden vectors over the frames of a speaker; the hidden vectors produced from all the frames are assumed to be equally important. We relax this assumption and compute the speaker embedding as a weighted average of a speaker's frame-level hidden vectors and their weights are automatically determined by a self-attention mechanism. The effect of multiple attention heads are also investigated to capture different aspects of a speaker's input speech. Finally, a PLDA classifier is used to compare pairs of embeddings. The proposed self-attentive speaker embedding system is compared with a strong DNN embedding baseline on NIST SRE 2016. We find that the self-attentive embeddings achieve superior performance. Moreover, the improvement produced by the self-attentive speaker embeddings is consistent with both short and long testing utterances.


 DOI: 10.21437/Interspeech.2018-1158

Cite as: Zhu, Y., Ko, T., Snyder, D., Mak, B., Povey, D. (2018) Self-Attentive Speaker Embeddings for Text-Independent Speaker Verification. Proc. Interspeech 2018, 3573-3577, DOI: 10.21437/Interspeech.2018-1158.


@inproceedings{Zhu2018,
  author={Yingke Zhu and Tom Ko and David Snyder and Brian Mak and Daniel Povey},
  title={Self-Attentive Speaker Embeddings for Text-Independent Speaker Verification},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={3573--3577},
  doi={10.21437/Interspeech.2018-1158},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1158}
}