ISCA Archive Interspeech 2008
ISCA Archive Interspeech 2008

Two-stage prosody prediction for emotional text-to-speech synthesis

Hao Tang, Xi Zhou, Matthias Odisio, Mark Hasegawa-Johnson, Thomas S. Huang

In this paper, we adopt a difference approach to prosody prediction for emotional text-to-speech synthesis, where the prosodic variations between emotional and neutral speech are decomposed into the global and local prosodic variations and predicted using a twostage model. The global prosodic variations are modeled by the means and standard deviations of the prosodic parameters, while the local prosodic variations are modeled by the classification and regression tree (CART) and dynamic programming. The proposed two-stage prosody prediction model has been successfully implemented as a prosodic module in a Festival-MBROLA architecture based emotional text-to-speech synthesis system, which is able to synthesize highly intelligible, natural and expressive speech.


doi: 10.21437/Interspeech.2008-554

Cite as: Tang, H., Zhou, X., Odisio, M., Hasegawa-Johnson, M., Huang, T.S. (2008) Two-stage prosody prediction for emotional text-to-speech synthesis. Proc. Interspeech 2008, 2138-2141, doi: 10.21437/Interspeech.2008-554

@inproceedings{tang08b_interspeech,
  author={Hao Tang and Xi Zhou and Matthias Odisio and Mark Hasegawa-Johnson and Thomas S. Huang},
  title={{Two-stage prosody prediction for emotional text-to-speech synthesis}},
  year=2008,
  booktitle={Proc. Interspeech 2008},
  pages={2138--2141},
  doi={10.21437/Interspeech.2008-554}
}