
International Symposium on Chinese Spoken Language Processing
(ISCSLP 2002)
Taipei, Taiwan
August 2324, 2002 

A Statistical Model with Hierarchical Structure for Predicting Prosody in a Mandarin TexttoSpeech System
MingShing Yu, NengHuang Pan, MingJer Wu
National ChungHsing University, Taichung, Taiwan
In this paper we proposed a statistical prosody model with
hierarchical structure for Mandarin TexttoSpeech (TTS) system.
There are four levels in our model: syllable level, word level, breath
group (prosodic phrase) level, and utterance level. Here
"hierarchy" means that each lower level is a subset of a higher
level. The prosodic information is first found in each level, and then
they are combined to get the predicted prosody. Since there are
only a few parameters in each level, the size of our training corpus
need not be very large. Thus the data sparsity problem, which is
often encountered in using some other models, such as neural nets
or CART (Classification and Regression Tree), can be relieved.
Besides, smaller training corpus size can also save the training time
and disk space. In each level, we calculate the means of syllables
with the same condition. Finally, we combine the results of each
level in our model. Our prosody generator can predict the syllable
duration, pause, energy and pitch contour. The experimental results
show that the predicted prosodic values and their original values
match very well.
Full Paper
Bibliographic reference.
YU, MingShing / PAN, NengHuang / WU, MingJer (2002):
"A statistical model with hierarchical structure for predicting prosody in a Mandarin texttospeech system",
In ISCSLP 2002, paper 20.