Third International Conference on Spoken Language Processing (ICSLP 94)

Yokohama, Japan
September 18-22, 1994

Generation of Prosody in Speech Synthesis Using Large Speech Data-Base

Naohiro Sakurai, Takerni Mochida, Tetsunori Kobayashi, Katsuhiko Shirai

Department of Electrical Engineering Waseda University, Tokyo, Japan

In order to improve the naturalness of synthetic speech in Japanese text-to-speech or concept-to-speech conversion, we introduce a new scheme to synthesize arbitrary speech sentences using the natural sentence speech data-base. In our synthesis method, a series of synthetic parameters is generated using patterns which are extracted from natural speech waveforms. In the first step, the basic sentence is selected from the data-base against a target sentence. The factors for the selection are phrase dependency structure(separation degree), number of mora, type of accent and phonemic labels. In the second step, if necessary, the basic accent-phrase is selected from the same data-base against the each target accent-phrase. The factors considered in selecting the each accent-phrase are the separation degree, the number of mora, the type of accent and the phonemic labels. In the third step, pitch pattern is generated from those waveform units selected in the first and the second step. In the last step, the phonemic parameters are generated. These phonemic parameters for several morae are extracted on the former three steps. Therefore, in this step, we only have to replace the phonemic parameters for ill-suited morae. As the pitch pattern is generated using patterns directly extracted from real speech, it is expected to be more natural than any other pattern which is estimated by any model. We have examined this method on Japanese sentence speech to the present and affirmed that the synthetic sound preserves human-like features fairly well.

