Voice Conversion Across Arbitrary Speakers Based on a Single Target-Speaker Utterance

Songxiang Liu, Jinghua Zhong, Lifa Sun, Xixin Wu, Xunying Liu, Helen Meng


Developing a voice conversion (VC) system for a particular speaker typically requires considerable data from both the source and target speakers. This paper aims to effectuate VC across arbitrary speakers, which we call any-to-any VC, with only a single target-speaker utterance. Two systems are studied: (1) the i-vector-based VC (IVC) system and (2) the speaker-encoder-based VC (SEVC) system. Phonetic PosteriorGrams are adopted as speaker-independent linguistic features extracted from speech samples. Both systems train a multi-speaker deep bidirectional long-short term memory (DBLSTM) VC model, taking in additional inputs that encode speaker identities, in order to generate the outputs. In the IVC system, the speaker identity of a new target speaker is represented by i-vectors. In the SEVC system, the speaker identity is represented by speaker embedding predicted from a separately trained model. Experiments verify the effectiveness of both systems in achieving VC based only on a single target-speaker utterance. Furthermore, the IVC approach is superior to SEVC, in terms of the quality of the converted speech and its similarity to the utterance produced by the genuine target speaker.


 DOI: 10.21437/Interspeech.2018-1504

Cite as: Liu, S., Zhong, J., Sun, L., Wu, X., Liu, X., Meng, H. (2018) Voice Conversion Across Arbitrary Speakers Based on a Single Target-Speaker Utterance. Proc. Interspeech 2018, 496-500, DOI: 10.21437/Interspeech.2018-1504.


@inproceedings{Liu2018,
  author={Songxiang Liu and Jinghua Zhong and Lifa Sun and Xixin Wu and Xunying Liu and Helen Meng},
  title={Voice Conversion Across Arbitrary Speakers Based on a Single Target-Speaker Utterance},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={496--500},
  doi={10.21437/Interspeech.2018-1504},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1504}
}