Personalized Singing Voice Generation Using WaveRNN

Xiaoxue Gao, Xiaohai Tian, Yi Zhou, Rohan Kumar Das, Haizhou Li


In this paper, we formulate a personalized singing voice generation (SVG) framework using WaveRNN with non-parallel training data. We develop an average singing voice generation model using WaveRNN from multi-singer's vocals. To map singing Phonetic PosteriorGrams and prosody features from singing template to time-domain singing samples, a speaker i-vector extracted from target speech is used to control the speaker identity of the generated singing. At run-time, a singing template and target speech samples are used for target singing vocal generation. Specifically, the content and the speaker identity of the target speech is not necessarily the same as that of the singing template. Experimental results on the NUS-48E and NUS-HLT-SLS corpora suggest that the personalized SVG framework outperforms the traditional conversion-vocoder pipeline in the subjective and objective evaluations.


 DOI: 10.21437/Odyssey.2020-36

Cite as: Gao, X., Tian, X., Zhou, Y., Das, R.K., Li, H. (2020) Personalized Singing Voice Generation Using WaveRNN. Proc. Odyssey 2020 The Speaker and Language Recognition Workshop, 252-258, DOI: 10.21437/Odyssey.2020-36.


@inproceedings{Gao2020,
  author={Xiaoxue Gao and Xiaohai Tian and Yi Zhou and Rohan Kumar Das and Haizhou Li},
  title={{Personalized Singing Voice Generation Using WaveRNN}},
  year=2020,
  booktitle={Proc. Odyssey 2020 The Speaker and Language Recognition Workshop},
  pages={252--258},
  doi={10.21437/Odyssey.2020-36},
  url={http://dx.doi.org/10.21437/Odyssey.2020-36}
}