Modeling and Transforming Speech Using Variational Autoencoders

Merlijn Blaauw, Jordi Bonada


Latent generative models can learn higher-level underlying factors from complex data in an unsupervised manner. Such models can be used in a wide range of speech processing applications, including synthesis, transformation and classification. While there have been many advances in this field in recent years, the application of the resulting models to speech processing tasks is generally not explicitly considered. In this paper we apply the variational autoencoder (VAE) to the task of modeling frame-wise spectral envelopes. The VAE model has many attractive properties such as continuous latent variables, prior probability over these latent variables, a tractable lower bound on the marginal log likelihood, both generative and recognition models, and end-to-end training of deep models. We consider different aspects of training such models for speech data and compare them to more conventional models such as the Restricted Boltzmann Machine (RBM). While evaluating generative models is difficult, we try to obtain a balanced picture by considering both performance in terms of reconstruction error and when applying the model to a series of modeling and transformation tasks to get an idea of the quality of the learned features.


DOI: 10.21437/Interspeech.2016-1183

Cite as

Blaauw, M., Bonada, J. (2016) Modeling and Transforming Speech Using Variational Autoencoders. Proc. Interspeech 2016, 1770-1774.

Bibtex
@inproceedings{Blaauw+2016,
author={Merlijn Blaauw and Jordi Bonada},
title={Modeling and Transforming Speech Using Variational Autoencoders},
year=2016,
booktitle={Interspeech 2016},
doi={10.21437/Interspeech.2016-1183},
url={http://dx.doi.org/10.21437/Interspeech.2016-1183},
pages={1770--1774}
}