11th Annual Conference of the International Speech Communication Association

Makuhari, Chiba, Japan
September 26-30. 2010

A Hierarchical F0 Modeling Method for HMM-Based Speech Synthesis

Ming Lei (1), Yijian Wu (2), Frank K. Soong (3), Zhen-Hua Ling (1), Lirong Dai (1)

(1) University of Science & Technology of China, China
(2) Microsoft, China
(3) Microsoft Research, China

The conventional state-based F0 modeling in HMM-based speech synthesis system is good at capturing micro prosodic features, but difficult to characterize long term pitch patterns directly. This paper presents a hierarchical F0 modeling method to address this issue. In this method, different F0 models are used to model the pitch patterns for different prosodic layers (including state, phone, syllable, word, etc), and are combined with an additive structure. In model training, the F0 model for each layer is firstly initialized by using the residual between original F0s and generated F0s from other layers as training data, and then the F0 models of all layers are re-estimated simultaneously under a minimum generation error (MGE) training framework. We investigate the effectiveness of hierarchical F0 modeling with different layer settings, experimental results show that the proposed hierarchical F0 modeling method significantly outperforms the conventional state-based F0 modeling method.

Full Paper

Bibliographic reference.  Lei, Ming / Wu, Yijian / Soong, Frank K. / Ling, Zhen-Hua / Dai, Lirong (2010): "A hierarchical F0 modeling method for HMM-based speech synthesis", In INTERSPEECH-2010, 2170-2173.