Appropriate phoneme durations are essential for high quality speech synthesis. In hidden Markov model-based text-tospeech (HMM-TTS), durations are typically modeled statistically using state duration probability distributions and duration prediction for unseen contexts. Use of rich context features enables synthesis without high-level linguistic knowledge. In this paper we analyze the accuracy of state duration modeling against phone duration modeling using simple prediction techniques. In addition to the decision tree-based techniques, regression techniques for rich context features with high collinearity are discussed and evaluated.
Cite as: Silén, H., Helander, E., Nurminen, J., Gabbouj, M. (2010) Analysis of duration prediction accuracy in HMM-based speech synthesis. Proc. Speech Prosody 2010, paper 510
@inproceedings{silen10_speechprosody, author={Hanna Silén and Elina Helander and Jani Nurminen and Moncef Gabbouj}, title={{Analysis of duration prediction accuracy in HMM-based speech synthesis}}, year=2010, booktitle={Proc. Speech Prosody 2010}, pages={paper 510} }