Using a Manifold Vocoder for Spectral Voice and Style Conversion

Tuan Dinh, Alexander Kain, Kris Tjaden


We propose a new type of spectral feature that is both compact and interpolable, and thus ideally suited for regression approaches that involve averaging. The feature is realized by means of a speaker-independent variational autoencoder (VAE), which learns a latent space based on the low-dimensional manifold of high-resolution speech spectra. In vocoding experiments, we showed that using a 12-dimensional VAE feature (VAE-12) resulted in significantly better perceived speech quality compared to a 12-dimensional MCEP feature. In voice conversion experiments, using VAE-12 resulted in significantly better perceived speech quality as compared to 40-dimensional MCEPs, with similar speaker accuracy. In habitual to clear style conversion experiments, we significantly improved the speech intelligibility for one of three speakers, using a custom skip-connection deep neural network, with the average keyword recall accuracy increasing from 24% to 46%.


 DOI: 10.21437/Interspeech.2019-1176

Cite as: Dinh, T., Kain, A., Tjaden, K. (2019) Using a Manifold Vocoder for Spectral Voice and Style Conversion. Proc. Interspeech 2019, 1388-1392, DOI: 10.21437/Interspeech.2019-1176.


@inproceedings{Dinh2019,
  author={Tuan Dinh and Alexander Kain and Kris Tjaden},
  title={{Using a Manifold Vocoder for Spectral Voice and Style Conversion}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={1388--1392},
  doi={10.21437/Interspeech.2019-1176},
  url={http://dx.doi.org/10.21437/Interspeech.2019-1176}
}