Jointly Trained Conversion Model and WaveNet Vocoder for Non-Parallel Voice Conversion Using Mel-Spectrograms and Phonetic Posteriorgrams

Songxiang Liu, Yuewen Cao, Xixin Wu, Lifa Sun, Xunying Liu, Helen Meng


The N10 system in the Voice Conversion Challenge 2018 (VCC 2018) has achieved high voice conversion (VC) performance in terms of speech naturalness and speaker similarity. We believe that further improvements can be gained from joint optimization (instead of separate optimization) of the conversion model and WaveNet vocoder, as well as leveraging information from the acoustic representation of the speech waveform, e.g. from Mel-spectrograms. In this paper, we propose a VC architecture to jointly train a conversion model that maps phonetic posteriorgrams (PPGs) to Mel-spectrograms and a WaveNet vocoder. The conversion model has a bottle-neck layer, whose outputs are concatenated with PPGs before being fed into the WaveNet vocoder as local conditioning. A weighted sum of a Mel-spectrogram prediction loss and a WaveNet loss is used as the objective function to jointly optimize parameters of the conversion model and the WaveNet vocoder. Objective and subjective evaluation results show that the proposed approach is capable of achieving significantly improved quality in voice conversion in terms of speech naturalness and speaker similarity of the converted speech for both cross-gender and intra-gender conversions.


 DOI: 10.21437/Interspeech.2019-1316

Cite as: Liu, S., Cao, Y., Wu, X., Sun, L., Liu, X., Meng, H. (2019) Jointly Trained Conversion Model and WaveNet Vocoder for Non-Parallel Voice Conversion Using Mel-Spectrograms and Phonetic Posteriorgrams. Proc. Interspeech 2019, 714-718, DOI: 10.21437/Interspeech.2019-1316.


@inproceedings{Liu2019,
  author={Songxiang Liu and Yuewen Cao and Xixin Wu and Lifa Sun and Xunying Liu and Helen Meng},
  title={{Jointly Trained Conversion Model and WaveNet Vocoder for Non-Parallel Voice Conversion Using Mel-Spectrograms and Phonetic Posteriorgrams}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={714--718},
  doi={10.21437/Interspeech.2019-1316},
  url={http://dx.doi.org/10.21437/Interspeech.2019-1316}
}