Recent studies have shown the effectiveness of the use of word vectors in DNN-based speech synthesis. However, these word vectors trained from a large amount of text generally carry not prosodic information, which is important information for speech synthesis, but semantic information. Therefore, if word vectors that take prosodic information into account can be obtained, it would be expected to improve the quality of synthesized speech. In this paper, to obtain word-level vectors that take prosodic information into account, we propose a novel prosody aware word-level encoder. A novel point of the proposed technique is to train a word-level encoder by using a large speech corpus constructed for automatic speech recognition. A word-level encoder that estimates the F0 contour for each word from the input word sequence is trained. The outputs of the bottleneck layer in the trained encoder are used as the word-level vector. By training the relationship between words and their prosodic information by using large speech corpus, the outputs of the bottleneck layer would be expected to contain prosodic information. The results of objective and subjective experiments indicate the proposed technique can synthesize speech with improved naturalness.
Cite as: Ijima, Y., Hojo, N., Masumura, R., Asami, T. (2017) Prosody Aware Word-Level Encoder Based on BLSTM-RNNs for DNN-Based Speech Synthesis. Proc. Interspeech 2017, 764-768, doi: 10.21437/Interspeech.2017-521
@inproceedings{ijima17_interspeech, author={Yusuke Ijima and Nobukatsu Hojo and Ryo Masumura and Taichi Asami}, title={{Prosody Aware Word-Level Encoder Based on BLSTM-RNNs for DNN-Based Speech Synthesis}}, year=2017, booktitle={Proc. Interspeech 2017}, pages={764--768}, doi={10.21437/Interspeech.2017-521} }