International Symposium on Chinese Spoken Language Processing (ISCSLP 2002)

Taipei, Taiwan
August 23-24, 2002

A Statistical Model with Hierarchical Structure for Predicting Prosody in a Mandarin Text-to-Speech System

Ming-Shing Yu, Neng-Huang Pan, Ming-Jer Wu

National Chung-Hsing University, Taichung, Taiwan

In this paper we proposed a statistical prosody model with hierarchical structure for Mandarin Text-to-Speech (TTS) system. There are four levels in our model: syllable level, word level, breath group (prosodic phrase) level, and utterance level. Here "hierarchy" means that each lower level is a subset of a higher level. The prosodic information is first found in each level, and then they are combined to get the predicted prosody. Since there are only a few parameters in each level, the size of our training corpus need not be very large. Thus the data sparsity problem, which is often encountered in using some other models, such as neural nets or CART (Classification and Regression Tree), can be relieved. Besides, smaller training corpus size can also save the training time and disk space. In each level, we calculate the means of syllables with the same condition. Finally, we combine the results of each level in our model. Our prosody generator can predict the syllable duration, pause, energy and pitch contour. The experimental results show that the predicted prosodic values and their original values match very well.


Full Paper

Bibliographic reference.  YU, Ming-Shing / PAN, Neng-Huang / WU, Ming-Jer (2002): "A statistical model with hierarchical structure for predicting prosody in a Mandarin text-to-speech system", In ISCSLP 2002, paper 20.