What Does the Speaker Embedding Encode?

Shuai Wang, Yanmin Qian, Kai Yu

Developing a good speaker embedding has received tremendous interest in the speech community. Speaker representations such as i-vector, d-vector have shown their superiority in speaker recognition, speaker adaptation and other related tasks. However, not much is known about which properties are exactly encoded in these speaker embeddings. In this work, we make an in-depth investigation on three kinds of speaker embeddings, i.e. i-vector, d-vector and RNN/LSTM based sequence-vector (s-vector). Classification tasks are carefully designed to facilitate better understanding of these encoded speaker representations. Their abilities of encoding different properties are revealed and compared, such as speaker identity, gender, speaking rate, text content and channel information. Moreover, a new architecture is proposed to integrate different speaker embeddings, so that the advantages can be combined. The new advanced speaker embedding (i-s-vector) outperforms the others, and shows a more than 50% EER reduction compared to the i-vector baseline on the RSR2015 content mismatch trials.

 DOI: 10.21437/Interspeech.2017-1125

Cite as: Wang, S., Qian, Y., Yu, K. (2017) What Does the Speaker Embedding Encode?. Proc. Interspeech 2017, 1497-1501, DOI: 10.21437/Interspeech.2017-1125.

  author={Shuai Wang and Yanmin Qian and Kai Yu},
  title={What Does the Speaker Embedding Encode?},
  booktitle={Proc. Interspeech 2017},