One-Shot Voice Conversion with Global Speaker Embeddings

Hui Lu, Zhiyong Wu, Dongyang Dai, Runnan Li, Shiyin Kang, Jia Jia, Helen Meng

Building a voice conversion (VC) system for a new target speaker typically requires a large amount of speech data from the target speaker. This paper investigates a method to build a VC system for arbitrary target speaker using one given utterance without any adaptation training process. Inspired by global style tokens (GSTs), which recently has been shown to be effective in controlling the style of synthetic speech, we propose the use of global speaker embeddings (GSEs) to control the conversion target of the VC system. Speaker-independent phonetic posteriorgrams (PPGs) are employed as the local condition input to a conditional WaveNet synthesizer for waveform generation of the target speaker. Meanwhile, spectrograms are extracted from the given utterance and fed into a reference encoder, the generated reference embedding is then employed as attention query to the GSEs to produce the speaker embedding, which is employed as the global condition input to the WaveNet synthesizer to control the generated waveform’s speaker identity. In experiments, when compared with an adaptation training based any-to-any VC system, the proposed GSEs based VC approach performs equally well or better in both speech naturalness and speaker similarity, with apparently higher flexibility to the comparison.

 DOI: 10.21437/Interspeech.2019-2365

Cite as: Lu, H., Wu, Z., Dai, D., Li, R., Kang, S., Jia, J., Meng, H. (2019) One-Shot Voice Conversion with Global Speaker Embeddings. Proc. Interspeech 2019, 669-673, DOI: 10.21437/Interspeech.2019-2365.

  author={Hui Lu and Zhiyong Wu and Dongyang Dai and Runnan Li and Shiyin Kang and Jia Jia and Helen Meng},
  title={{One-Shot Voice Conversion with Global Speaker Embeddings}},
  booktitle={Proc. Interspeech 2019},