ISCA Archive Interspeech 2017
ISCA Archive Interspeech 2017

Statistical Voice Conversion with WaveNet-Based Waveform Generation

Kazuhiro Kobayashi, Tomoki Hayashi, Akira Tamamori, Tomoki Toda

This paper presents a statistical voice conversion (VC) technique with theWaveNet-based waveform generation. VC based on a Gaussian mixture model (GMM) makes it possible to convert the speaker identity of a source speaker into that of a target speaker. However, in the conventional vocoding process, various factors such as F0 extraction errors, parameterization errors and over-smoothing effects of converted feature trajectory cause the modeling errors of the speech waveform, which usually bring about sound quality degradation of the converted voice. To address this issue, we apply a direct waveform generation technique based on a WaveNet vocoder to VC. In the proposed method, first, the acoustic features of the source speaker are converted into those of the target speaker based on the GMM. Then, the waveform samples of the converted voice are generated based on the WaveNet vocoder conditioned on the converted acoustic features. In this paper, to investigate the modeling accuracies of the converted speech waveform, we compare several types of the acoustic features for training and synthesizing based on the WaveNet vocoder. The experimental results confirmed that the proposed VC technique achieves higher conversion accuracy on speaker individuality with comparable sound quality compared to the conventional VC technique.

doi: 10.21437/Interspeech.2017-986

Cite as: Kobayashi, K., Hayashi, T., Tamamori, A., Toda, T. (2017) Statistical Voice Conversion with WaveNet-Based Waveform Generation. Proc. Interspeech 2017, 1138-1142, doi: 10.21437/Interspeech.2017-986

  author={Kazuhiro Kobayashi and Tomoki Hayashi and Akira Tamamori and Tomoki Toda},
  title={{Statistical Voice Conversion with WaveNet-Based Waveform Generation}},
  booktitle={Proc. Interspeech 2017},