A Voice Conversion Framework with Tandem Feature Sparse Representation and Speaker-Adapted WaveNet Vocoder

Berrak Sisman, Mingyang Zhang, Haizhou Li


A voice conversion system typically consists of two modules, the feature conversion module that is followed by a vocoder. The exemplar-based sparse representation marks a success in feature conversion when we only have a very limited amount of training data. While parametric vocoder is generally designed to simulate the mechanics of the human speech generation process under certain simplification assumptions, it doesn't work consistently well for all target applications. In this paper, we study two effective ways to make use of the limited amount of training data for voice conversion. Firstly, we study a novel technique for sparse representation that augments the spectral features with phonetic information, or Tandem Feature. Secondly, we study the use of WaveNet vocoder that can be trained on multi-speaker and target speaker data to improve the vocoding quality. We evaluate that the proposed strategy with Tandem Feature and WaveNet vocoder and show that it provides performance improvement consistently over the traditional sparse representations framework in objective and subjective evaluations.


 DOI: 10.21437/Interspeech.2018-1131

Cite as: Sisman, B., Zhang, M., Li, H. (2018) A Voice Conversion Framework with Tandem Feature Sparse Representation and Speaker-Adapted WaveNet Vocoder. Proc. Interspeech 2018, 1978-1982, DOI: 10.21437/Interspeech.2018-1131.


@inproceedings{Sisman2018,
  author={Berrak Sisman and Mingyang Zhang and Haizhou Li},
  title={A Voice Conversion Framework with Tandem Feature Sparse Representation and Speaker-Adapted WaveNet Vocoder},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={1978--1982},
  doi={10.21437/Interspeech.2018-1131},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1131}
}