5th International Conference on Spoken Language Processing
Accurate estimation of segmental durations is crucial for natural-sounding text-to-speech (TTS) synthesis. This paper presents a model of vowel duration used in the Bell Labs Japanese TTS system. We describe the constraints on vowel devoicing, and effects of factors such as phone identity, surrounding phone identities, accentuation, syllabic structure, and phrasal position on the duration of both long and short vowels. A Sum-of-Products approach is used to model key interactions observed in the data, and to predict values of factor combinations not found in the speech database. We report root mean squared deviations between observed and predicted durations ranging from 8 to 15 ms, and an overall correlation of 0.89.
Bibliographic reference. Venditti, Jennifer J. / Santen, Jan P. H. van (1998): "Modeling vowel duration for Japanese text-to-speech synthesis", In ICSLP-1998, paper 0786.