The HMM-based Text-to-Speech System can produce high quality synthetic speech with flexible modeling of spectral and prosodic parameters. However, the prosodic features, like F0 and duration trajectories, generated by HMM-based speech synthesis are often excessively smoothed and lack prosodic variance. In HMM-based TTS durations are typically modeled statistically using state duration probability distributions and duration prediction for unseen contexts without high-level linguistic knowledge. And F0 trajectory is generated by the MSD-HMMs as a weighted bias term. In this approach, discrete distributions are used for modeling the VU decision and continuous Gaussian distributions are used for F0 modeling within the voiced regions. Due to this assumption of undefined F0 values in unvoiced regions and the special structure of MSD-HMM, the generated F0 values are limited in accuracy. In this paper, in order to improve the prosodic features generation against the standard HMM framework, an F0 generation process model is used to re-estimate F0 values in the regions of pitch tracking errors, as well as in unvoiced regions. A prior knowledge of VU is imposed in each Mandarin phoneme and they are used for VU decision. Also we design a set of syntax features to improve Mandarin phoneme duration prediction.
Index Terms: Mandarin speech synthesis, F0 generation, Duration modeling, generation process model, HMM-based TTS
Cite as: Wang, M., Wen, M., Saito, D., Hirose, K., Minematsu, N. (2010) Improved generation of prosodic features in HMM-based Mandarin speech synthesis. Proc. 7th ISCA Workshop on Speech Synthesis (SSW 7), 359-364
@inproceedings{wang10b_ssw, author={Miaomiao Wang and Miaomiao Wen and Daisuke Saito and Keikichi Hirose and Nobuaki Minematsu}, title={{Improved generation of prosodic features in HMM-based Mandarin speech synthesis}}, year=2010, booktitle={Proc. 7th ISCA Workshop on Speech Synthesis (SSW 7)}, pages={359--364} }