Enhance the Word Vector with Prosodic Information for the Recurrent Neural Network Based TTS System

Xin Wang, Shinji Takaki, Junichi Yamagishi


Word embedding, which is a dense and low-dimensional vector representation of word, is recently used to replace of the conventional prosodic context as an input feature to the acoustic model of a TTS system. However, these word vectors trained from text data may encode insufficient information related to speech. This paper presents a post-filtering approach to enhance the raw word vectors with prosodic information for the TTS task. Based on a publicly available speech corpus with manual prosodic annotation, a post-filter can be trained to transform the raw word vectors. Experiment shows that using the enhanced word vectors as an input to the neural network-based acoustic model improves the accuracy of the predicted F0 trajectory. Besides, we also show that the enhanced vectors provide better initial values than the raw vectors for error back-propagation of the network, which results in further improvement.


DOI: 10.21437/Interspeech.2016-390

Cite as

Wang, X., Takaki, S., Yamagishi, J. (2016) Enhance the Word Vector with Prosodic Information for the Recurrent Neural Network Based TTS System. Proc. Interspeech 2016, 2856-2860.

Bibtex
@inproceedings{Wang+2016,
author={Xin Wang and Shinji Takaki and Junichi Yamagishi},
title={Enhance the Word Vector with Prosodic Information for the Recurrent Neural Network Based TTS System},
year=2016,
booktitle={Interspeech 2016},
doi={10.21437/Interspeech.2016-390},
url={http://dx.doi.org/10.21437/Interspeech.2016-390},
pages={2856--2860}
}