Information Preservation Pooling for Speaker Embedding

Min Hyun Han, Woo Hyun Kang, Sung Hwan Mun, Nam Soo Kim


Many recent studies on speaker embedding focused on the pooling technique. In the task of speaker recognition, pooling plays an important role of summarizing inputs with variable length into a fixed dimensional output. One of the most popular pooling method for text-independent speaker verification system is attention based pooling method which utilizes an attention mechanism to give different weights to each frame. Utterance-level features are generated by computing weighted means and standard deviations of frame-level features. However, useful information in frame-level features can be compromised during the pooling step. In this paper, we propose a information preservation pooling method that exploits a mutual information neural estimator to preserve local information in frame-level features during the pooling step. We conducted the evaluation on VoxCeleb datasets, which shows that the proposed method reduces equal error rate from the conventional method by 14.6%


 DOI: 10.21437/Odyssey.2020-9

Cite as: Han, M.H., Kang, W.H., Mun, S.H., Kim, N.S. (2020) Information Preservation Pooling for Speaker Embedding. Proc. Odyssey 2020 The Speaker and Language Recognition Workshop, 60-66, DOI: 10.21437/Odyssey.2020-9.


@inproceedings{Han2020,
  author={Min Hyun Han and Woo Hyun Kang and Sung Hwan Mun and Nam Soo Kim},
  title={{Information Preservation Pooling for Speaker Embedding}},
  year=2020,
  booktitle={Proc. Odyssey 2020 The Speaker and Language Recognition Workshop},
  pages={60--66},
  doi={10.21437/Odyssey.2020-9},
  url={http://dx.doi.org/10.21437/Odyssey.2020-9}
}