Tied Mixture of Factor Analyzers Layer to Combine Frame Level Representations in Neural Speaker Embeddings

Nanxin Chen, Jesús Villalba, Najim Dehak


In this paper, a novel neural network layer is proposed to combine frame-level into utterance-level representations for speaker modeling. We followed the assumption that the frame-level outputs of the speaker embedding (a.k.a x-vector) encoder are multi-modal. Therefore, we modeled the frame-level information as a mixture of factor analyzers with latent variable (utterance embedding) tied across frames and mixture components, in as similar way as in the i-vector approach. We denote this layer as Tied Mixture of Factor Analyzers (TMFA) layer. The optimal value of the embedding is obtained by minimizing the reconstruction error of the frame-level representations given the embedding and the TMFA model parameters. However, the TMFA layer parameters (factor loading matrices, means and precisions) were trained with cross-entropy loss as the rest of parameters of the network. We experimented on the Speaker Recognition Evaluation 2016 Cantonese as well as in the Speaker in the Wild datasets. The proposed pooling layer improved w.r.t. mean plus standard deviation pooling — standard in x-vector approach — in most of the conditions evaluated; and obtained competitive performance w.r.t. the recently proposed learnable dictionary encoding pooling method, which also assumes multi-modal frame-level representations.


 DOI: 10.21437/Interspeech.2019-1782

Cite as: Chen, N., Villalba, J., Dehak, N. (2019) Tied Mixture of Factor Analyzers Layer to Combine Frame Level Representations in Neural Speaker Embeddings. Proc. Interspeech 2019, 2948-2952, DOI: 10.21437/Interspeech.2019-1782.


@inproceedings{Chen2019,
  author={Nanxin Chen and Jesús Villalba and Najim Dehak},
  title={{Tied Mixture of Factor Analyzers Layer to Combine Frame Level Representations in Neural Speaker Embeddings}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={2948--2952},
  doi={10.21437/Interspeech.2019-1782},
  url={http://dx.doi.org/10.21437/Interspeech.2019-1782}
}