Eighth ISCA Workshop on Speech Synthesis
Barcelona, Catalonia, Spain
The pitch contour in speech contains information about different linguistic units at several distinct temporal scales. At the finest level, the microprosodic cues are purely segmental in nature, whereas in the coarser time scales, lexical tones, word accents, and phrase accents appear with both linguistic and paralinguistic functions. Consequently, the pitch movements happen on different temporal scales: the segmental perturbations are faster than typical pitch accents and so forth. In HMM-based speech synthesis paradigm, slower intonation patterns are not easy to model. The statistical procedure of decision tree clustering highlights instances that are more common, resulting in good reproduction of microprosody and declination, but with less variation on word and phrase level compared to human speech. Here we present a system that uses wavelets to decompose the pitch contour into five temporal scales ranging from microprosody to the utterance level. Each component is then individually trained within HMM framework and used in a superpositional manner at the synthesis stage. The resulting system is compared to a baseline where only one decision tree is trained to generate the pitch contour. Index Terms: HMM-based synthesis, intonation modeling, wavelet decomposition
Bibliographic reference. Suni, Antti / Aalto, Daniel / Raitio, Tuomo / Alku, Paavo / Vainio, Martti (2013): "Wavelets for intonation modeling in HMM speech synthesis", In SSW8, 285-290.