A Speaker-Dependent WaveNet for Voice Conversion with Non-Parallel Data

Xiaohai Tian, Eng Siong Chng, Haizhou Li


In a typical voice conversion system, vocoder is commonly used for speech-to-features analysis and features-to-speech synthesis. However, vocoder can be a source of speech quality degradation. This paper presents a novel approach to voice conversion using WaveNet for non-parallel training data. Instead of reconstructing speech with intermediate features, the proposed approach utilizes the WaveNet to map the Phonetic PosteriorGrams (PPGs) to the waveform samples directly. In this way, we avoid the estimation errors arising from vocoding and feature conversion. Additionally, as PPG is assumed to be speaker independent, the proposed approach also reduces the feature mismatch problem in WaveNet vocoder based solutions. Experimental results conducted on the CMU-ARCTIC database show that the proposed approach significantly outperforms the traditional vocoder and WaveNet Vocoder baselines in terms of speech quality.


 DOI: 10.21437/Interspeech.2019-1514

Cite as: Tian, X., Chng, E.S., Li, H. (2019) A Speaker-Dependent WaveNet for Voice Conversion with Non-Parallel Data. Proc. Interspeech 2019, 201-205, DOI: 10.21437/Interspeech.2019-1514.


@inproceedings{Tian2019,
  author={Xiaohai Tian and Eng Siong Chng and Haizhou Li},
  title={{A Speaker-Dependent WaveNet for Voice Conversion with Non-Parallel Data}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={201--205},
  doi={10.21437/Interspeech.2019-1514},
  url={http://dx.doi.org/10.21437/Interspeech.2019-1514}
}