Prosodic structure generation from text plays an important role in Chinese text-to-speech (TTS) synthesis, which greatly influences the naturalness and intelligibility of the synthesized speech. This paper proposes a multi-task learning method for prosodic structure generation using bidirectional long short-term memory (BLSTM) recurrent neural network (RNN) and structured output layer (SOL). Unlike traditional methods where prerequisites such as lexicon word or even syntactic tree are usually required as the input, the proposed method predicts prosodic boundary labels directly from Chinese characters. BLSTM RNN is used to capture the bidirectional contextual dependencies of prosodic boundary labels. SOL further models correlations between prosodic structures, lexicon words as well as part-of-speech (POS), where the prediction of prosodic boundary labels are conditioned upon word tokenization and POS tagging results. Experimental results demonstrate the effectiveness of the proposed method.
Cite as: Huang, Y., Wu, Z., Li, R., Meng, H., Cai, L. (2017) Multi-Task Learning for Prosodic Structure Generation Using BLSTM RNN with Structured Output Layer. Proc. Interspeech 2017, 779-783, doi: 10.21437/Interspeech.2017-949
@inproceedings{huang17c_interspeech, author={Yuchen Huang and Zhiyong Wu and Runnan Li and Helen Meng and Lianhong Cai}, title={{Multi-Task Learning for Prosodic Structure Generation Using BLSTM RNN with Structured Output Layer}}, year=2017, booktitle={Proc. Interspeech 2017}, pages={779--783}, doi={10.21437/Interspeech.2017-949} }