Joint Learning of J-Vector Extractor and Joint Bayesian Model for Text Dependent Speaker Verification

Ziqiang Shi, Liu Liu, Huibin Lin, Rujie Liu


J-vector and joint Bayesian have been proved to be very effective in text dependent speaker verification with short-duration speech. However current state-of-the-art framework often consider training the J-vector extractor and the joint Bayesian classifier separately. Such an approach will result in information loss for j-vector learning and also fail to exploit an end-to-end framework. In this paper we present a integrated approach to text dependent speaker verification, which consists of a siamese deep neural network that takes two variable length speech segments and maps them to the likelihood score and speaker/phrase labels, where the likelihood score as a loss guide is computed by a variant joint Bayesian model. The likelihood loss guide can constrain the j-vector extractor for improving the verification performance. Since the strengths of j-vector and joint Bayesian analysis appear complementary the joint learning significantly outperforms traditional separate training scheme. Our experiments on the the public RSR2015 part I data corpus demonstrate that this new training scheme can produce more discriminative j-vectors and leading to performance improvement on the speaker verification task.


 DOI: 10.21437/Interspeech.2018-1500

Cite as: Shi, Z., Liu, L., Lin, H., Liu, R. (2018) Joint Learning of J-Vector Extractor and Joint Bayesian Model for Text Dependent Speaker Verification. Proc. Interspeech 2018, 1076-1080, DOI: 10.21437/Interspeech.2018-1500.


@inproceedings{Shi2018,
  author={Ziqiang Shi and Liu Liu and Huibin Lin and Rujie Liu},
  title={Joint Learning of J-Vector Extractor and Joint Bayesian Model for Text Dependent Speaker Verification},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={1076--1080},
  doi={10.21437/Interspeech.2018-1500},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1500}
}