High-quality Voice Conversion Using Spectrogram-Based WaveNet Vocoder

Kuan Chen, Bo Chen, Jiahao Lai, Kai Yu


Waveform generator is a key component in voice conversion. Recently, WaveNet waveform generator conditioned on the Mel-cepstrum (Mcep) has shown better quality over standard vocoder. In this paper, an enhanced WaveNet model based on spectrogram is proposed to further improve voice conversion performance. Here, Mel-frequency spectrogram is converted from source speaker to target speaker using an LSTM-RNN based frame-to-frame feature mapping. To evaluate the performance, the proposed approach is compared to an Mcep based LSTM-RNN voice conversion system. Both STRAIGHT vocoder and Mcep-based WaveNet vocoder are elected to produce the converted speech for Mcep conversion system. The fundamental frequency (F0) of the converted speech in different systems is analyzed. The naturalness, similarity and intelligibility are evaluated in subjective measures. Results show that the spectrogram based WaveNet waveform generator can achieve better voice conversion quality compared to traditional WaveNet approaches. The Mel-spectrogram based voice conversion can achieve significant improvement in speaker similarity and inherent F0 conversion.


 DOI: 10.21437/Interspeech.2018-1528

Cite as: Chen, K., Chen, B., Lai, J., Yu, K. (2018) High-quality Voice Conversion Using Spectrogram-Based WaveNet Vocoder. Proc. Interspeech 2018, 1993-1997, DOI: 10.21437/Interspeech.2018-1528.


@inproceedings{Chen2018,
  author={Kuan Chen and Bo Chen and Jiahao Lai and Kai Yu},
  title={High-quality Voice Conversion Using Spectrogram-Based WaveNet Vocoder},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={1993--1997},
  doi={10.21437/Interspeech.2018-1528},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1528}
}