ISCA Archive SSW 2023
ISCA Archive SSW 2023

HiFi-VC: High Quality ASR-based Voice Conversion

Anton Kashkin, Ivan Karpukhin, Svyatoslav Shishkin

The goal of voice conversion is to convert the input voice tomatch the target speaker’s voice while keeping text and prosodyintact. Voice conversion is usually used in entertainment andspeaking-aid systems, as well as applied for speech data generation and augmentation. The development of any-to-any voiceconversion systems, which are capable of generating voices unseen during training, is of particular interest to both researchersand the industry. Despite recent progress, any-to-any conversion quality is still inferior to natural speech.In this work, we propose a new any-to-any voice conversionpipeline. To the best of our knowledge, it is the first use of anASR encoder with a GAN training objective in the voice conversion system. We also implement a joint conditional decoder-vocoder model, which simplifies training and improves performance. According to multiple subjective and objective evaluations, our method outperforms modern systems in terms ofvoice quality, similarity, and consistency.

doi: 10.21437/SSW.2023-16

Cite as: Kashkin, A., Karpukhin, I., Shishkin, S. (2023) HiFi-VC: High Quality ASR-based Voice Conversion. Proc. 12th ISCA Speech Synthesis Workshop (SSW2023), 100-105, doi: 10.21437/SSW.2023-16

  author={Anton Kashkin and Ivan Karpukhin and Svyatoslav Shishkin},
  title={{HiFi-VC: High Quality ASR-based Voice Conversion}},
  booktitle={Proc. 12th ISCA Speech Synthesis Workshop (SSW2023)},