Superpositional modeling of fundamental frequency contours for HMM-based speech synthesis

Keikichi Hirose, Hiroya Hashimoto, Daisuke Saito, Nobuaki Minematsu

Statistical parametric speech synthesis technologies, such as HMM-based and DNN-based ones, gain special attention from researchers because of their ability in generating speech in various voice qualities and styles. In these methods, all acoustic parameters (except durational ones) are handled in a frame-by-frame manner, which is not appropriate for prosodic features. Although relation of adjacent frames is viewed, it is not enough. Prosodic features are related to words, phrases, sentences, and even paragraphs, and should be viewed in a wider time span. One possible way to handle the features well in speech synthesis process is to model fundamental frequency (F0) movements and to apply its constraints. Among several models of F0 contours, the generation process model of F0 contours is ideal for the purpose, since it can well represent hierarchical structure of prosody as superposition of phrase and accent components keeping a clear relationship with linguistic information. A method is developed which decomposes F0 contours into three layers based on the model, and handles them as different streams in the HMM-based speech synthesis process. Advantage of the method is confirmed through objective and subjective evaluations. Issues of flexible control of prosody are also addressed.

DOI: 10.21437/SpeechProsody.2016-158

Cite as

Hirose, K., Hashimoto, H., Saito, D., Minematsu, N. (2016) Superpositional modeling of fundamental frequency contours for HMM-based speech synthesis. Proc. Speech Prosody 2016, 771-775.

author={Keikichi Hirose and Hiroya Hashimoto and Daisuke Saito and Nobuaki Minematsu},
title={Superpositional modeling of fundamental frequency contours for HMM-based speech synthesis},
booktitle={Speech Prosody 2016},