The i-vector technique using deep neural network has been successfully applied in spoken language identification systems. Neural network modeling showed its effectiveness as both discriminant feature transformation and classification in many tasks, in particular with a large training data set. However, on a small data set, neural networks suffer from the overfitting problem which degrades the performance. Many strategies have been investigated and used to improve the regularization for deep neural networks, for example, weigh decay, dropout, data augmentation. In this paper, we study and use conditional generative adversarial nets as a classifier for the spoken language identification task. Unlike the previous works on GAN for image generation, our purpose is to focus on improving regularization of the neural network by jointly optimizing the “Real/Fake” objective function and the categorical objective function. Compared with dropout and data augmentation methods, the proposed method obtained 29.7% and 31.8% relative improvement on NIST 2015 i-vector challenge data set for spoken language identification.
Cite as: Shen, P., Lu, X., Li, S., Kawai, H. (2017) Conditional Generative Adversarial Nets Classifier for Spoken Language Identification. Proc. Interspeech 2017, 2814-2818, doi: 10.21437/Interspeech.2017-553
@inproceedings{shen17b_interspeech, author={Peng Shen and Xugang Lu and Sheng Li and Hisashi Kawai}, title={{Conditional Generative Adversarial Nets Classifier for Spoken Language Identification}}, year=2017, booktitle={Proc. Interspeech 2017}, pages={2814--2818}, doi={10.21437/Interspeech.2017-553} }