Probabilistic linear discriminant analysis (PLDA) is the de facto standard for backends in i-vector speaker recognition. If we try to extend the PLDA paradigm using non-linear models, e.g., deep neural networks, the posterior distributions of the latent variables and the marginal likelihood become intractable. In this paper, we propose to approach this problem using stochastic gradient variational Bayes. We generalize the PLDA model to let i-vectors depend non-linearly on the latent factors. We approximate the evidence lower bound (ELBO) by Monte Carlo sampling using the reparametrization trick. This enables us to optimize of the ELBO using backpropagation to jointly estimate the parameters that define the model and the approximate posteriors of the latent factors. We also present a reformulation of the likelihood ratio, which we call Q-scoring. Q-scoring makes possible to efficiently score the speaker verification trials for this model. Experimental results on NIST SRE10 suggest that more data might be required to exploit the potential of this method.
Cite as: Villalba, J., Brümmer, N., Dehak, N. (2017) Tied Variational Autoencoder Backends for i-Vector Speaker Recognition. Proc. Interspeech 2017, 1004-1008, doi: 10.21437/Interspeech.2017-1018
@inproceedings{villalba17_interspeech, author={Jesús Villalba and Niko Brümmer and Najim Dehak}, title={{Tied Variational Autoencoder Backends for i-Vector Speaker Recognition}}, year=2017, booktitle={Proc. Interspeech 2017}, pages={1004--1008}, doi={10.21437/Interspeech.2017-1018} }