In HMM-based TTS, while the segmental quality of synthesized speech is quite acceptable, intonation, especially at the sentence level, tends to be somewhat bland. The maximum likelihood (ML) criterion used in HMM training and parameter trajectory generation is partially responsible for the blandness. Additionally, the F0 trajectory thus generated has a smaller dynamic range than that of natural speech, and the synthesized speech does not sound lively. We propose to use multiple additive regression trees, a gradient-based, tree-boosting algorithm, for producing a more natural F0 trajectory. Multiple additive trees are trained in successive stages to minimize the error squares between natural and predicted F0 values. Additive tree modeling is integrated with MSD-HMM, which is an ideal model for characterizing the partially continuous (voiced/unvoiced) F0 contour. Experimental results in both Mandarin and English TTS trials show that the proposed approach can increase not only the dynamic range of generated F0 trajectory, but improve other objective (RMSE, correlation coefficient, voiced/unvoiced swapping errors) and subjective quality measures.
Bibliographic reference. Qian, Yao / Liang, Hui / Soong, Frank K. (2008): "Generating natural F0 trajectory with additive trees", In INTERSPEECH-2008, 2126-2129.