Sixth European Conference on Speech Communication and Technology

Budapest, Hungary
September 5-9, 1999

A Japanese Text-to-Speech System Based on Multi-form Units with Consideration of Frequency Distribution in Japanese

Kimihito Tanaka, Hideyuki Mizuno, Masanobu Abe, Shin'ya Nakajima

NTT Cyber Space Laboratories, Japan

This paper proposes our new text-to-speech (TTS) system that concatenates large numbers of speech segments to produce very natural and intelligible synthetic speech. One novel point of our system is its new synthesis unit, which is has three remarkable characteristics as follows; - The synthesis units contain all Japanese syllables together with all possible vowel sequences, so very smooth synthetic speech is produced. - Both previous and succeeding phoneme environments are considered when speech segments are concatenated, so natural sounding transients from a vowel to a consonant, which is the only concatenation point with the proposed unit, are present in the synthetic speech. - Each unit has various fundamental frequency (F0 ) contours. Therefore, F0 modification rates are very small in any synthesis event, and the F0 modification process causes only minor distortion. To develop a unit database efficiently and effectively, we analyzed 4,850,000 Japanese phrases (breath-group) containing 87,810,000 phonemes and ranked them in order of appearance frequency. Listening tests confirm the high intelligibility and naturalness of speech produced by our new TTS system. It uses the 50,000 highest frequency units that cover over 77% of Japanese texts.

Full Paper (PDF)   Gnu-Zipped Postscript

Acoustic Example

