International Symposium on Chinese Spoken Language Processing
August 23-24, 2002
A Statistical Model with Hierarchical Structure for Predicting Prosody in a Mandarin Text-to-Speech System
Ming-Shing Yu, Neng-Huang Pan, Ming-Jer Wu
National Chung-Hsing University, Taichung, Taiwan
In this paper we proposed a statistical prosody model with
hierarchical structure for Mandarin Text-to-Speech (TTS) system.
There are four levels in our model: syllable level, word level, breath
group (prosodic phrase) level, and utterance level. Here
"hierarchy" means that each lower level is a subset of a higher
level. The prosodic information is first found in each level, and then
they are combined to get the predicted prosody. Since there are
only a few parameters in each level, the size of our training corpus
need not be very large. Thus the data sparsity problem, which is
often encountered in using some other models, such as neural nets
or CART (Classification and Regression Tree), can be relieved.
Besides, smaller training corpus size can also save the training time
and disk space. In each level, we calculate the means of syllables
with the same condition. Finally, we combine the results of each
level in our model. Our prosody generator can predict the syllable
duration, pause, energy and pitch contour. The experimental results
show that the predicted prosodic values and their original values
match very well.
YU, Ming-Shing / PAN, Neng-Huang / WU, Ming-Jer (2002):
"A statistical model with hierarchical structure for predicting prosody in a Mandarin text-to-speech system",
In ISCSLP 2002, paper 20.