In the conventional HMM-based TTS, the micro structure of F0 contour is modeled at the state level via a (clustered) decision tree. However, the decision tree based state-level modeling is difficult to capture the long term structure of speech prosody, say at intonation phrase level, due to its greedy search nature and usually sparse training data for covering a large, combinatorial number of usually long prosodic contexts in a phrase or sentence. In this study, we adopt a finite number of Discrete Cosine Transform (DCT) coefficients to capture the smoothed trend of F0 patterns of intonation phrases and to normalize the variable duration effects in phrase length. We then use DCT smoothed contours to model phrase intonations with a decision tree or a deep neural network (DNN). The remaining details or the residual F0 is then accommodated by training a state-level model in a Hierarchical Prosody Model (HPM) framework. The internal phrase models are then used to predict the intonation phrase F0 contours and then combine it with the predicted state-level F0 residuals to predict final F0 contours. Either the decision tree based or the DNN based F0 predictors, when working together with the state-level F0 residual predictors, outperform the standard, state-level HMM F0 models.
Bibliographic reference. Yin, Xiang / Lei, Ming / Qian, Yao / Soong, Frank K. / He, Lei / Ling, Zhen-Hua / Dai, Li-Rong (2014): "Modeling DCT parameterized F0 trajectory at intonation phrase level with DNN or decision tree", In INTERSPEECH-2014, 2273-2277.