Generative Adversarial Networks for Singing Voice Conversion with and without Parallel Data

Berrak Sisman, Haizhou Li


Singing voice conversion (SVC) is a task to convert one singer's voice to sound like that of another, without changing the lyrical content. Singing conveys both lexical and emotional information through words and tones, that needs to be transferred from the source to target. In this paper, we propose novel solutions to SVC based on Generative Adversarial Networks (GANs) with and without parallel training data. With parallel data, we employ GANs to minimize the differences of the distributions between the original target parameters and the generated singing parameters. With non-parallel training data, we employ CycleGANs to estimate an optimal pseudo pair between source and target singers. Moreover, the proposed solutions perform well with limited amount of training data. The experiments show that (1) GANs outperform other state-of-the-art voice conversion when parallel training data are available, (2) CycleGANs achieve competitive voice conversion quality without the need of parallel training data.


 DOI: 10.21437/Odyssey.2020-34

Cite as: Sisman, B., Li, H. (2020) Generative Adversarial Networks for Singing Voice Conversion with and without Parallel Data. Proc. Odyssey 2020 The Speaker and Language Recognition Workshop, 238-244, DOI: 10.21437/Odyssey.2020-34.


@inproceedings{Sisman2020,
  author={Berrak Sisman and Haizhou Li},
  title={{Generative Adversarial Networks for Singing Voice Conversion with and without Parallel Data}},
  year=2020,
  booktitle={Proc. Odyssey 2020 The Speaker and Language Recognition Workshop},
  pages={238--244},
  doi={10.21437/Odyssey.2020-34},
  url={http://dx.doi.org/10.21437/Odyssey.2020-34}
}