Cycle consistent generative adversarial network (CycleGAN) and variational autoencoder (VAE) based models have gained popularity in non-parallel voice conversion recently. However, they often suffer from difficult training process and unsatisfactory results. In this paper, we propose a contrastive learning-based adversarial approach for voice conversion, namely contrastive voice conversion (CVC). Compared to previous CycleGAN-based methods, CVC only requires an efficient one-way GAN training by taking the advantage of contrastive learning. When it comes to non-parallel one-to-one voice conversion, CVC is on par or better than CycleGAN and VAE while effectively reducing training time. CVC further demonstrates superior performance in many-to-one voice conversion, enabling the conversion from unseen speakers.
Cite as: Li, T., Liu, Y., Hu, C., Zhao, H. (2021) CVC: Contrastive Learning for Non-Parallel Voice Conversion. Proc. Interspeech 2021, 1324-1328, doi: 10.21437/Interspeech.2021-137
@inproceedings{li21d_interspeech, author={Tingle Li and Yichen Liu and Chenxu Hu and Hang Zhao}, title={{CVC: Contrastive Learning for Non-Parallel Voice Conversion}}, year=2021, booktitle={Proc. Interspeech 2021}, pages={1324--1328}, doi={10.21437/Interspeech.2021-137} }