14thAnnual Conference of the International Speech Communication Association

Lyon, France
August 25-29, 2013

A Targets-Based Superpositional Model of Fundamental Frequency Contours Applied to HMM-Based Speech Synthesis

Jinfu Ni, Yoshinori Shiga, Chiori Hori, Yutaka Kidawara

NICT, Japan

Superpositional model of fundamental frequency (F0) contours as suggested by the Fujisaki model can well represent F0 movements of speech keeping a clear relation with linguistic information of utterances. Therefore, improvement of HMM-based speech synthesis is expected by using the merit of superpositional model. In this paper, a targets-based superpositional model is proposed in the light of the Fujisaki model. Here, both accent and phrase components are parameterized by respectively defined low and high targets which allow flexible interaction between accent and phrase components. Due to the flexible interaction, the new method consistently treats such complex F0 movements as low digging, varying declination, and final lowering by simply adjusting parameter values. This facilitates extraction of the model parameters from observed F0 contours, which is one of major problems preventing the use of the Fujisaki model. Extraction of the target parameters is evaluated for a Japanese speech corpus and the F0 contours generated by the model are used for HMM training instead of the original. Listening test of synthetic speech indicates significant improvements in speech quality. Micro-prosodic effects are also investigated. Results show that adding the micro-prosody to the generated F0 contours does not significantly improve speech quality.

Full Paper

Bibliographic reference.  Ni, Jinfu / Shiga, Yoshinori / Hori, Chiori / Kidawara, Yutaka (2013): "A targets-based superpositional model of fundamental frequency contours applied to HMM-based speech synthesis", In INTERSPEECH-2013, 1052-1056.