ISCA Archive SSW 1998
ISCA Archive SSW 1998

Modeling segmental durations for Japanese text-to-speech synthesis

Jennifer J. Venditti, Jan P. H. van Santen

Accurate estimation of segmental durations is crucial for naturalsounding text-to-speech (TTS) synthesis. This paper presents a model of segmental duration used in the Bell Labs Japanese TTS system. We describe the constraints on vowel devoicing, and effects of factors such as phone identity, surrounding phone identities, accentuation, syllabic structure, and phrasal position on the duration of both consonants and vowels. A Sum-of-Products approach is used to model key interactions observed in the data, and to predict values of factor combinations not found in the speech database. We report overall observed-predicted correlations of 0.88 for vowels (RMSdev: 16.8ms) and 0.94 for consonants (RMSdev: 12.5ms).


Cite as: Venditti, J.J., Santen, J.P.H.v. (1998) Modeling segmental durations for Japanese text-to-speech synthesis. Proc. 3rd ESCA/COCOSDA Workshop on Speech Synthesis (SSW 3), 31-36

@inproceedings{venditti98_ssw,
  author={Jennifer J. Venditti and Jan P. H. van Santen},
  title={{Modeling segmental durations for Japanese text-to-speech synthesis}},
  year=1998,
  booktitle={Proc. 3rd ESCA/COCOSDA Workshop on Speech Synthesis (SSW 3)},
  pages={31--36}
}