Unsupervised Learning for Expressive Speech Synthesis

Igor Jauk

This article describes the homonymous PhD thesis realized at the Universitat Polit├Ęcnica de Catalunya. The main topic and the goal of the thesis was to research unsupervised manners of training expressive voices for tasks such as audiobook reading. The experiments were conducted on acoustic and semantic domains. In the acoustic domain, the goal was to find a feature set which is suitable to represent expressiveness in speech. The basis for such a set were the i-vectors. The proposed feature set outperformed state-of-the-art sets extracted with OpenSmile. Involving the semantic domain, the goal was first to predict acoustic features from semantic embeddings of text for expressive speech and to use the predict vectors as acoustic cluster centroids to adapt voices. The result was a system which automatically reads paragraphs with expressive voice and a second system which can be considered as an expressive search engine and leveraged to train voices with specific expressions. The third experiment evolved to neural network based speech synthesis and the usage of sentiment embeddings. The embeddings were used as an additional input to the synthesis system. The system was evaluated in a preference test showing the success of the approach.

 DOI: 10.21437/IberSPEECH.2018-38

Cite as: Jauk, I. (2018) Unsupervised Learning for Expressive Speech Synthesis. Proc. IberSPEECH 2018, 189-193, DOI: 10.21437/IberSPEECH.2018-38.

  author={Igor Jauk},
  title={{Unsupervised Learning for Expressive Speech Synthesis}},
  booktitle={Proc. IberSPEECH 2018},