Predictive Auxiliary Variational Autoencoder for Representation Learning of Global Speech Characteristics

Sebastian Springenberg, Egor Lakomkin, Cornelius Weber, Stefan Wermter


Unsupervised learning represents an important opportunity for obtaining useful speech representations. Recently, variational autoencoders (VAEs) have been shown to extract useful representations in an unsupervised manner. These models are usually not designed to explicitly disentangle specific sources of information. When processing data of sequential nature which involves multi-timescale information, disentanglement can however be beneficial. In this paper we address this issue by developing a predictive auxiliary variational autoencoder to obtain speech representations at different timescales. We will present an auxiliary lower bound which is used to develop a model that we call the Predictive Aux-VAE. The model is designed to disentangle global from local information into a dedicated auxiliary variable. Learned representations are analysed with respect to their ability to capture global speech characteristics. We observe that representations of individual speakers are separated well in the latent space and can successfully be used in a subsequent speaker identification task where they achieve high classification accuracy, comparable to a fully supervised model. Moreover, manipulating the global variable allows to change global characteristics while retaining the local content during generation which demonstrates the success of our model to disentangle global from local information.


 DOI: 10.21437/Interspeech.2019-2845

Cite as: Springenberg, S., Lakomkin, E., Weber, C., Wermter, S. (2019) Predictive Auxiliary Variational Autoencoder for Representation Learning of Global Speech Characteristics. Proc. Interspeech 2019, 934-938, DOI: 10.21437/Interspeech.2019-2845.


@inproceedings{Springenberg2019,
  author={Sebastian Springenberg and Egor Lakomkin and Cornelius Weber and Stefan Wermter},
  title={{Predictive Auxiliary Variational Autoencoder for Representation Learning of Global Speech Characteristics}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={934--938},
  doi={10.21437/Interspeech.2019-2845},
  url={http://dx.doi.org/10.21437/Interspeech.2019-2845}
}