Learning Mixture Representation for Deep Speaker Embedding Using Attention

Weiwei Lin, Man Wai Mak, Lu Yi


Almost all speaker recognition systems involve a step that converts a sequence of frame-level features to a fixed dimension representation. In the context of deep neural networks, it is referred to as statistics pooling. In state-of-the-art speak recognition systems, statistics pooling is implemented by concatenating the mean and standard deviation of a sequence of frame-level features. However, a single mean and standard deviation are very limited descriptive statistics for an acoustic sequence even with a powerful feature extractor like a convolutional neural network. In this paper, we propose a novel statistics pooling method that can produce more descriptive statistics through a mixture representation. Our method is inspired by the expectation-maximization (EM) algorithm in Gaussian mixture models (GMMs). However, unlike the GMMs, the mixture assignments are given by an attention mechanism instead of the Euclidean distances between frame-level features and explicit centers. Applying the proposed attention mechanism to a 121-layer Densenet, we achieve an EER of 1.1\% in VoxCeleb1 and an EER of 4.77\% in VOiCES 2019 evaluation set.


 DOI: 10.21437/Odyssey.2020-30

Cite as: Lin, W., Mak, M.W., Yi, L. (2020) Learning Mixture Representation for Deep Speaker Embedding Using Attention. Proc. Odyssey 2020 The Speaker and Language Recognition Workshop, 210-214, DOI: 10.21437/Odyssey.2020-30.


@inproceedings{Lin2020,
  author={Weiwei Lin and Man Wai Mak and Lu Yi},
  title={{Learning Mixture Representation for Deep Speaker Embedding Using Attention}},
  year=2020,
  booktitle={Proc. Odyssey 2020 The Speaker and Language Recognition Workshop},
  pages={210--214},
  doi={10.21437/Odyssey.2020-30},
  url={http://dx.doi.org/10.21437/Odyssey.2020-30}
}