The conventional state-based F0 modeling in HMM-based speech synthesis system is good at capturing micro prosodic features, but difficult to characterize long term pitch patterns directly. This paper presents a hierarchical F0 modeling method to address this issue. In this method, different F0 models are used to model the pitch patterns for different prosodic layers (including state, phone, syllable, word, etc), and are combined with an additive structure. In model training, the F0 model for each layer is firstly initialized by using the residual between original F0s and generated F0s from other layers as training data, and then the F0 models of all layers are re-estimated simultaneously under a minimum generation error (MGE) training framework. We investigate the effectiveness of hierarchical F0 modeling with different layer settings, experimental results show that the proposed hierarchical F0 modeling method significantly outperforms the conventional state-based F0 modeling method.
Bibliographic reference. Lei, Ming / Wu, Yijian / Soong, Frank K. / Ling, Zhen-Hua / Dai, Lirong (2010): "A hierarchical F0 modeling method for HMM-based speech synthesis", In INTERSPEECH-2010, 2170-2173.