Model-Based Parametric Prosody Synthesis with Deep Neural Network

Hao Liu, Heng Lu, Xu Shao, Yi Xu

Conventional statistical parametric speech synthesis (SPSS) captures only frame-wise acoustic observations and computes probability densities at HMM state level to obtain statistical acoustic models combined with decision trees, which is therefore a purely statistical data-driven approach without explicit integration of any articulatory mechanisms found in speech production research. The present study explores an alternative paradigm, namely, model-based parametric prosody synthesis (MPPS), which integrates dynamic mechanisms of human speech production as a core component of F0 generation. In this paradigm, contextual variations in prosody are processed in two separate yet integrated stages: linguistic to motor, and motor to acoustic. Here the motor model is target approximation (TA), which generates syllable-sized F0 contours with only three motor parameters that are associated to linguistic functions. In this study, we simulate this two-stage process by linking the TA model to a deep neural network (DNN), which learns the “linguistic-motor” mapping given the “motor-acoustic” mapping provided by TA-based syllable-wise F0 production. The proposed prosody modeling system outperforms the HMM-based baseline system in both objective and subjective evaluations.

DOI: 10.21437/Interspeech.2016-1325

Cite as

Liu, H., Lu, H., Shao, X., Xu, Y. (2016) Model-Based Parametric Prosody Synthesis with Deep Neural Network. Proc. Interspeech 2016, 2313-2317.

author={Hao Liu and Heng Lu and Xu Shao and Yi Xu},
title={Model-Based Parametric Prosody Synthesis with Deep Neural Network},
booktitle={Interspeech 2016},