This paper presents the description of our submitted system for Voice Conversion Challenge (VCC) 2020 with vectorquantization variational autoencoder (VQ-VAE) with WaveNet as the decoder, i.e., VQ-VAE-WaveNet. VQ-VAE-WaveNet is a nonparallel VAE-based voice conversion that reconstruct the acoustic features along with separating the linguistic information with speaker identity. The model is further improved with the WaveNet cycle as the decoder to generate the high-quality speech waveform, since WaveNet, as an autoregressive neural vocoder, has achieved the STOA result of waveform generation. In practice, our system can be developed with VCC 2020 dataset for both Task 1 (intra-lingual) and Task 2 (cross-lingual). However, we only submitted our system for the intra-lingual voice conversion task. The results of VCC 2020 have demonstrated that our system VQ-VAE-WaveNet achieves: 3.04 mean opinion score (MOS) in naturalness and 3.28 average score in similarity ( speaker similarity percentage (Sim) of 75.99\%) for Task 1. What’s more, our system performs well in some objective evaluations. Specifically our system achieved an average score of 3.95 in naturalness in automatic naturalness prediction and ranked the 6th and 8th, respectively in ASV-based speaker similarity and spoofing countermeasures.
Cite as: Zhang, H. (2020) The NeteaseGames System for Voice Conversion Challenge 2020 with Vector-quantization Variational Autoencoder and WaveNet. Proc. Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020, 175-179, doi: 10.21437/VCCBC.2020-27
@inproceedings{zhang20b_vccbc, author={Haitong Zhang}, title={{The NeteaseGames System for Voice Conversion Challenge 2020 with Vector-quantization Variational Autoencoder and WaveNet}}, year=2020, booktitle={Proc. Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020}, pages={175--179}, doi={10.21437/VCCBC.2020-27} }