Third International Conference on Spoken Language Processing (ICSLP 94)
This paper describes on synthesis units for text-to-speech synthesis. A kind of synthesis unit and extraction of units are very important problems in speech synthesis by rule. In general, a synthesis method based on longer units reduces the difficulties in realization of coarticulation. VCV, CVC, demi-syllable, tri-phone are typical units for Japanese text-to-speech system. Recently, non-uniform units or context-dependent units are proposed, and good results are reported. However such a kind of unit is considered only phonetic context for realization of coarticulation. It is fact that phonetic context is one of the most important factor of variation of spectral feature, but prosodic features also are important. Our basic idea is introducing prosodic features into control of spectral features in order to realize natural sounded synthetic speech. In this paper, we report results of basic analytic experiments. The results show that there is obvious relation between prosodic features and spectral features, and that spectral control method considered not only phonetic context but also prosodic feature is able to improve quality of synthetic speech in text-to-speech system.
Bibliographic reference. Ishikawa, Yasushi / Nakajima, Kunio (1994): "On synthesis units for Japanese text-to-speech synthesis", In ICSLP-1994, 1751-1754.