Data Augmentation Using Variational Autoencoder for Embedding Based Speaker Verification

Zhanghao Wu, Shuai Wang, Yanmin Qian, Kai Yu


Domain or environment mismatch between training and testing, such as various noises and channels, is a major challenge for speaker verification. In this paper, a variational autoencoder (VAE) is designed to learn the patterns of speaker embeddings extracted from noisy speech segments, including i-vector and x-vector, and generate embeddings with more diversity to improve the robustness of speaker verification systems with probabilistic linear discriminant analysis (PLDA) back-end. The approach is evaluated on the standard NIST SRE 2016 dataset. Compared to manual and generative adversarial network (GAN) based augmentation approaches, the proposed VAE based augmentation achieves a slightly better performance for i-vector on Tagalog and Cantonese with EERs of 15.54% and 7.84%, and a more significant improvement for x-vector on those two languages with EERs of 11.86% and 4.20%.


 DOI: 10.21437/Interspeech.2019-2248

Cite as: Wu, Z., Wang, S., Qian, Y., Yu, K. (2019) Data Augmentation Using Variational Autoencoder for Embedding Based Speaker Verification. Proc. Interspeech 2019, 1163-1167, DOI: 10.21437/Interspeech.2019-2248.


@inproceedings{Wu2019,
  author={Zhanghao Wu and Shuai Wang and Yanmin Qian and Kai Yu},
  title={{Data Augmentation Using Variational Autoencoder for Embedding Based Speaker Verification}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={1163--1167},
  doi={10.21437/Interspeech.2019-2248},
  url={http://dx.doi.org/10.21437/Interspeech.2019-2248}
}