The goal of voice conversion is to convert the input voice tomatch the target speaker’s voice while keeping text and prosodyintact. Voice conversion is usually used in entertainment andspeaking-aid systems, as well as applied for speech data generation and augmentation. The development of any-to-any voiceconversion systems, which are capable of generating voices unseen during training, is of particular interest to both researchersand the industry. Despite recent progress, any-to-any conversion quality is still inferior to natural speech.In this work, we propose a new any-to-any voice conversionpipeline. To the best of our knowledge, it is the first use of anASR encoder with a GAN training objective in the voice conversion system. We also implement a joint conditional decoder-vocoder model, which simplifies training and improves performance. According to multiple subjective and objective evaluations, our method outperforms modern systems in terms ofvoice quality, similarity, and consistency.
Cite as: Kashkin, A., Karpukhin, I., Shishkin, S. (2023) HiFi-VC: High Quality ASR-based Voice Conversion. Proc. 12th ISCA Speech Synthesis Workshop (SSW2023), 100-105, doi: 10.21437/SSW.2023-16
@inproceedings{kashkin23_ssw, author={Anton Kashkin and Ivan Karpukhin and Svyatoslav Shishkin}, title={{HiFi-VC: High Quality ASR-based Voice Conversion}}, year=2023, booktitle={Proc. 12th ISCA Speech Synthesis Workshop (SSW2023)}, pages={100--105}, doi={10.21437/SSW.2023-16} }