Odyssey 2010: The Speaker and Language Recognition Workshop
Brno, Czech Republic
We describe a new approach to speaker verification which, like Joint Factor Analysis, is based on a generative model of speaker and channel effects but differs from Joint Factor Analysis in several respects. Firstly, each utterance is represented by a low dimensional feature vector, rather than by a high dimensional set of Baum-Welch statistics. Secondly, heavy-tailed distributions are used in place of Gaussian distributions in formulating the model, so that the effect of outlying data is diminished, both in training the model and at recognition time. Thirdly, the likelihood ratio used for making verification decisions is calculated (using variational Bayes) in a way which is fully consistent with the modeling assumptions and the rules of probability. Finally, experimental results show that, in the case of telephone speech, these likelihood ratios do not need to be normalized in order to set a trial-independent threshold for verification decisions. We report results on female speakers for several conditions in the NIST 2008 speaker recognition evaluation data, including microphone as well as telephone speech. As measured both by equal error rates and the minimum values of the NIST detection cost function, the results on telephone speech are about 30% better than we have achieved using Joint Factor Analysis.
Bibliographic reference. Kenny, Patrick (2010): "Bayesian Speaker Verification with Heavy-Tailed Priors", In Odyssey-2010, paper 014 (abstract).