Group Latent Embedding for Vector Quantized Variational Autoencoder in Non-Parallel Voice Conversion

Shaojin Ding, Ricardo Gutierrez-Osuna


This paper proposes a Group Latent Embedding for Vector Quantized Variational Autoencoders (VQ-VAE) used in nonparallel Voice Conversion (VC). Previous studies have shown that VQ-VAE can generate high-quality VC syntheses when it is paired with a powerful decoder. However, in a conventional VQ-VAE, adjacent atoms in the embedding dictionary can represent entirely different phonetic content. Therefore, the VC syntheses can have mispronunciations and distortions whenever the output of the encoder is quantized to an atom representing entirely different phonetic content. To address this issue, we propose an approach that divides the embedding dictionary into groups and uses the weighted average of atoms in the nearest group as the latent embedding. We conducted both objective and subjective experiments on the non-parallel CSTR VCTK corpus. Results show that the proposed approach significantly improves the acoustic quality of the VC syntheses compared to the traditional VQ-VAE (13.7% relative improvement) while retaining the voice identity of the target speaker.


 DOI: 10.21437/Interspeech.2019-1198

Cite as: Ding, S., Gutierrez-Osuna, R. (2019) Group Latent Embedding for Vector Quantized Variational Autoencoder in Non-Parallel Voice Conversion. Proc. Interspeech 2019, 724-728, DOI: 10.21437/Interspeech.2019-1198.


@inproceedings{Ding2019,
  author={Shaojin Ding and Ricardo Gutierrez-Osuna},
  title={{Group Latent Embedding for Vector Quantized Variational Autoencoder in Non-Parallel Voice Conversion}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={724--728},
  doi={10.21437/Interspeech.2019-1198},
  url={http://dx.doi.org/10.21437/Interspeech.2019-1198}
}