The JHU Speaker Recognition System for the VOiCES 2019 Challenge

David Snyder, Jesús Villalba, Nanxin Chen, Daniel Povey, Gregory Sell, Najim Dehak, Sanjeev Khudanpur

This paper describes the systems developed by the JHU team for the speaker recognition track of the 2019 VOiCES from a Distance Challenge. On this far-field task, we achieved good performance using systems based on state-of-the-art deep neural network (DNN) embeddings. In this paradigm, a DNN maps variable-length speech segments to speaker embeddings, called x-vectors, that are then classified using probabilistic linear discriminant analysis (PLDA). Our submissions were composed of three x-vector-based systems that differed primarily in the DNN architecture, temporal pooling mechanism, and training objective function. On the evaluation set, our best single-system submission used an extended time-delay architecture, and achieved 0.435 in actual DCF, the primary evaluation metric. A fusion of all three x-vector systems was our primary submission, and it obtained an actual DCF of 0.362.

 DOI: 10.21437/Interspeech.2019-2979

Cite as: Snyder, D., Villalba, J., Chen, N., Povey, D., Sell, G., Dehak, N., Khudanpur, S. (2019) The JHU Speaker Recognition System for the VOiCES 2019 Challenge. Proc. Interspeech 2019, 2468-2472, DOI: 10.21437/Interspeech.2019-2979.

  author={David Snyder and Jesús Villalba and Nanxin Chen and Daniel Povey and Gregory Sell and Najim Dehak and Sanjeev Khudanpur},
  title={{The JHU Speaker Recognition System for the VOiCES 2019 Challenge}},
  booktitle={Proc. Interspeech 2019},