ISCA Archive Interspeech 2021
ISCA Archive Interspeech 2021

StarGANv2-VC: A Diverse, Unsupervised, Non-Parallel Framework for Natural-Sounding Voice Conversion

Yinghao Aaron Li, Ali Zare, Nima Mesgarani

We present an unsupervised non-parallel many-to-many voice conversion (VC) method using a generative adversarial network (GAN) called StarGAN v2. Using a combination of adversarial source classifier loss and perceptual loss, our model significantly outperforms previous VC models. Although our model is trained only with 20 English speakers, it generalizes to a variety of voice conversion tasks, such as any-to-many, cross-lingual, and singing conversion. Using a style encoder, our framework can also convert plain reading speech into stylistic speech, such as emotional and falsetto speech. Subjective and objective evaluation experiments on a non-parallel many-to-many voice conversion task revealed that our model produces natural sounding voices, close to the sound quality of state-of-the-art text-to-speech (TTS) based voice conversion methods without the need for text labels. Moreover, our model is completely convolutional and with a faster-than-real-time vocoder such as Parallel WaveGAN can perform real-time voice conversion.


doi: 10.21437/Interspeech.2021-319

Cite as: Li, Y.A., Zare, A., Mesgarani, N. (2021) StarGANv2-VC: A Diverse, Unsupervised, Non-Parallel Framework for Natural-Sounding Voice Conversion. Proc. Interspeech 2021, 1349-1353, doi: 10.21437/Interspeech.2021-319

@inproceedings{li21e_interspeech,
  author={Yinghao Aaron Li and Ali Zare and Nima Mesgarani},
  title={{StarGANv2-VC: A Diverse, Unsupervised, Non-Parallel Framework for Natural-Sounding Voice Conversion}},
  year=2021,
  booktitle={Proc. Interspeech 2021},
  pages={1349--1353},
  doi={10.21437/Interspeech.2021-319}
}