The HCCL-CUHK System for the Voice Conversion Challenge 2018

Songxiang Liu, Lifa Sun, Xixin Wu, Xunying Liu, Helen Meng


This paper presents our proposed system for the Voice Conversion Challenge 2018 (the VCC 2018), which is mainly characterized by doing Voice Conversion (VC) with non-parallel training data using Phonetic PosteriorGrams (PPGs). While conventional vocoders such as STRAIGHT are often used in many VC systems, the synthesized speech degrades in naturalness and similarity and the synthesize process is slow. We propose to use Short-Time Fourier Transform Magnitudes (STFTMs) to synthesize converted speech waveforms with Griffin-Lim algorithm. To fully exploit the different harmonic structure across different frequencies in the STFTMs, we partition the whole-band STFTMs into multiple overlapped frequency bands. Deep Bidirectional LSTM based RNNs (DBLSTM) have been shown successfully modeling the nonlinear mapping from PPGs to acoustic features in VC systems. However, training and conversion are very slow using such RNN models. To tackle this, the proposed system adopted Convolution Neural Networks (CNNs) with Gated Linear Units (GatedCNNs) to replace DBLSTMs. The VCC 2018 perceptual results shows that the proposed system can achieve higher naturalness and similarity performance than the average performance in the non-parallel VC task.


 DOI: 10.21437/Odyssey.2018-35

Cite as: Liu, S., Sun, L., Wu, X., Liu, X., Meng, H. (2018) The HCCL-CUHK System for the Voice Conversion Challenge 2018 . Proc. Odyssey 2018 The Speaker and Language Recognition Workshop, 248-254, DOI: 10.21437/Odyssey.2018-35.


@inproceedings{Liu2018,
  author={Songxiang Liu and Lifa Sun and Xixin Wu and Xunying Liu and Helen Meng},
  title={The HCCL-CUHK System for the Voice Conversion Challenge 2018	},
  year=2018,
  booktitle={Proc. Odyssey 2018 The Speaker and Language Recognition Workshop},
  pages={248--254},
  doi={10.21437/Odyssey.2018-35},
  url={http://dx.doi.org/10.21437/Odyssey.2018-35}
}