ISCA Archive SSW 2023
ISCA Archive SSW 2023

StarGAN-VC++: Towards Emotion Preserving Voice Conversion Using Deep Embeddings

Arnab Das, Suhita Ghosh, Tim Polzehl, Ingo Siegert, Sebastian Stober

Voice conversion (VC) transforms an utterance to sound like anotherperson without changing the linguistic content. A recently proposedgenerative adversarial network-based VC method, StarGANv2-VCis very successful in generating natural-sounding conversions.However, the method fails to preserve the emotion of the sourcespeaker in the converted samples. Emotion preservation is necessaryfor natural human-computer interaction. In this paper, we showthat StarGANv2-VC fails to disentangle the speaker and emotionrepresentations, pertinent to preserve emotion. Specifically, thereis an emotion leakage from the reference audio used to capture thespeaker embeddings while training. To counter the problem, wepropose novel emotion-aware losses and an unsupervised methodwhich exploits emotion supervision through latent emotion representations. The objective and subjective evaluations prove the efficacyof the proposed strategy over diverse datasets, emotions, gender, etc.


doi: 10.21437/SSW.2023-13

Cite as: Das, A., Ghosh, S., Polzehl, T., Siegert, I., Stober, S. (2023) StarGAN-VC++: Towards Emotion Preserving Voice Conversion Using Deep Embeddings. Proc. 12th ISCA Speech Synthesis Workshop (SSW2023), 81-87, doi: 10.21437/SSW.2023-13

@inproceedings{das23_ssw,
  author={Arnab Das and Suhita Ghosh and Tim Polzehl and Ingo Siegert and Sebastian Stober},
  title={{StarGAN-VC++: Towards Emotion Preserving Voice Conversion Using Deep Embeddings}},
  year=2023,
  booktitle={Proc. 12th ISCA Speech Synthesis Workshop (SSW2023)},
  pages={81--87},
  doi={10.21437/SSW.2023-13}
}