5th International Conference on Spoken Language Processing

Sydney, Australia
November 30 - December 4, 1998

Modeling Vowel Duration for Japanese Text-to-Speech Synthesis

Jennifer J. Venditti (1), Jan P. H. van Santen (2)

(1) Bell Labs and Ohio State Univ, USA
(2) Bell Labs, USA

Accurate estimation of segmental durations is crucial for natural-sounding text-to-speech (TTS) synthesis. This paper presents a model of vowel duration used in the Bell Labs Japanese TTS system. We describe the constraints on vowel devoicing, and effects of factors such as phone identity, surrounding phone identities, accentuation, syllabic structure, and phrasal position on the duration of both long and short vowels. A Sum-of-Products approach is used to model key interactions observed in the data, and to predict values of factor combinations not found in the speech database. We report root mean squared deviations between observed and predicted durations ranging from 8 to 15 ms, and an overall correlation of 0.89.

Full Paper

Bibliographic reference.  Venditti, Jennifer J. / Santen, Jan P. H. van (1998): "Modeling vowel duration for Japanese text-to-speech synthesis", In ICSLP-1998, paper 0786.