Spoken Language Recognition using X-vectors

David Snyder, Daniel Garcia-Romero, Alan McCree, Gregory Sell, Daniel Povey, Sanjeev Khudanpur


In this paper, we apply x-vectors to the task of spoken language recognition. This framework consists of a deep neural network that maps sequences of speech features to fixed-dimensional embeddings, called x-vectors. Long-term language characteristics are captured in the network by a temporal pooling layer that aggregates information across time. Once extracted, x-vectors utilize the same classification technology developed for i-vectors. In the 2017 NIST language recognition evaluation, x-vectors achieved excellent results and outperformed our state-of-the-art i-vector systems. In the post-evaluation analysis presented here, we experiment with several variations of the x-vector framework, and find that the best performing system uses multilingual bottleneck features, data augmentation, and a discriminative Gaussian classifier.


 DOI: 10.21437/Odyssey.2018-15

Cite as: Snyder, D., Garcia-Romero, D., McCree, A., Sell, G., Povey, D., Khudanpur, S. (2018) Spoken Language Recognition using X-vectors. Proc. Odyssey 2018 The Speaker and Language Recognition Workshop, 105-111, DOI: 10.21437/Odyssey.2018-15.


@inproceedings{Snyder2018,
  author={David Snyder and Daniel Garcia-Romero and Alan McCree and Gregory Sell and Daniel Povey and Sanjeev Khudanpur},
  title={Spoken Language Recognition using X-vectors},
  year=2018,
  booktitle={Proc. Odyssey 2018 The Speaker and Language Recognition Workshop},
  pages={105--111},
  doi={10.21437/Odyssey.2018-15},
  url={http://dx.doi.org/10.21437/Odyssey.2018-15}
}