In this paper, we apply x-vectors to the task of spoken language recognition. This framework consists of a deep neural network that maps sequences of speech features to fixed-dimensional embeddings, called x-vectors. Long-term language characteristics are captured in the network by a temporal pooling layer that aggregates information across time. Once extracted, x-vectors utilize the same classification technology developed for i-vectors. In the 2017 NIST language recognition evaluation, x-vectors achieved excellent results and outperformed our state-of-the-art i-vector systems. In the post-evaluation analysis presented here, we experiment with several variations of the x-vector framework, and find that the best performing system uses multilingual bottleneck features, data augmentation, and a discriminative Gaussian classifier.
Cite as: Snyder, D., Garcia-Romero, D., McCree, A., Sell, G., Povey, D., Khudanpur, S. (2018) Spoken Language Recognition using X-vectors. Proc. The Speaker and Language Recognition Workshop (Odyssey 2018), 105-111, doi: 10.21437/Odyssey.2018-15
@inproceedings{snyder18_odyssey, author={David Snyder and Daniel Garcia-Romero and Alan McCree and Gregory Sell and Daniel Povey and Sanjeev Khudanpur}, title={{Spoken Language Recognition using X-vectors}}, year=2018, booktitle={Proc. The Speaker and Language Recognition Workshop (Odyssey 2018)}, pages={105--111}, doi={10.21437/Odyssey.2018-15} }