One-Shot Voice Conversion with Disentangled Representations by Leveraging Phonetic Posteriorgrams

Seyed Hamidreza Mohammadi, Taehwan Kim


We propose voice conversion model from arbitrary source speaker to arbitrary target speaker with disentangled representations. Voice conversion is a task to convert the voice of spoken utterance of source speaker to that of target speaker. Most prior work require to know either source speaker or target speaker or both in training, with either parallel or non-parallel corpus. Instead, we study the problem of voice conversion in nonparallel speech corpora and one-shot learning setting. We convert an arbitrary sentences of an arbitrary source speaker to target speakers given only one or few target speaker training utterances. To achieve this, we propose to use disentangled representations of speaker identity and linguistic context. We use a recurrent neural network (RNN) encoder for speaker embedding and phonetic posteriorgram as linguistic context encoding, along with a RNN decoder to generate converted utterances. Ours is a simpler model without adversarial training or hierarchical model design and thus more efficient. In the subjective tests, our approach achieved significantly better results compared to baseline regarding similarity.


 DOI: 10.21437/Interspeech.2019-1798

Cite as: Mohammadi, S.H., Kim, T. (2019) One-Shot Voice Conversion with Disentangled Representations by Leveraging Phonetic Posteriorgrams. Proc. Interspeech 2019, 704-708, DOI: 10.21437/Interspeech.2019-1798.


@inproceedings{Mohammadi2019,
  author={Seyed Hamidreza Mohammadi and Taehwan Kim},
  title={{One-Shot Voice Conversion with Disentangled Representations by Leveraging Phonetic Posteriorgrams}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={704--708},
  doi={10.21437/Interspeech.2019-1798},
  url={http://dx.doi.org/10.21437/Interspeech.2019-1798}
}