Sixth International Conference on Spoken Language Processing
(ICSLP 2000)

Beijing, China
October 16-20, 2000

A Hybrid Statistical/RNN Approach to Prosody Synthesis for Taiwanese TTS

Sin-Horng Chen, Chen-Chung Ho

Department of Communication Engineering, Chiao Tung University, Hsinchu, Taiwan

In this paper a hybrid approach which incorporates statistical modeling of prosodic parameters into recurrent neural network (RNN)-based prosody synthesis for Min-Nan speech (Taiwanese) is proposed. It takes syllable as the basic synthesis unit and constructs statistical models for syllable-initial duration, syllable-final duration, inter-syllable pause duration, pitch contour of syllabIe, and log-energy level of syllabloe. In the training, it normalizes prosodic parameters by these statistical models and uses the results to train an RNN prosody synthesizer. In synthesis, it denormalizes the RNN outputs by the same statistical models to generate all prosodic parameters required by the TTS system. The advantage of the approach can be justified as to relieve the RNN prosody synthesizer of some affecting factors via taking care them by using the statistical models.


Full Paper

Bibliographic reference.  Chen, Sin-Horng / Ho, Chen-Chung (2000): "A hybrid statistical/RNN approach to prosody synthesis for taiwanese TTS", In ICSLP-2000, vol.1, 613-616.