5th International Conference on Spoken Language Processing
The task of generating natural human-sounding prosody for text-to-speech (TTS) has historically been one of the most challenging problems that researchers and developers have had to face. TTS systems have in general become infamous for their "robotic" intonations. This paper describes an approach to this problem which endeavors to capture as much detail as possible from speech data, but in a way that avoids the "black boxes" typical of neural networks and some vector clustering algorithms. Unlike these latter methods, our approach may give feedback as to exactly what the crucial parameters are that determine the successful choice of pattern. Focusing on the notion of prosody templates, we confirmed that a representative F0 and duration pattern can be extracted based on stress pattern for a target proper noun which occurs in sentence-initial position.
Bibliographic reference. Holm, Frode / Hata, Kazue (1998): "Common patterns in word level prosody", In ICSLP-1998, paper 1038.