9th Annual Conference of the International Speech Communication Association

Brisbane, Australia
September 22-26, 2008

Duration Refinement by Jointly Optimizing State and Longer Unit Likelihood

Boyang Gao, Yao Qian, Zhizheng Wu, Frank K. Soong

Microsoft Research Asia, China

We refine the duration model in HMM-based TTS by extending the work of Wu [1]. The model is refined by jointly maximizing the duration likelihoods of state, phone and syllable units. Both Gaussian and gamma distributions are employed. In synthesis, the state durations are generated by the same joint optimization procedure. By considering the duration of state and longer units jointly, the accumulation of errors in estimated state durations is regulated in the optimization procedure. Experiments on Mandarin and English databases show that the refined model yields more accurate duration predictions, compared with the baseline state duration model. The improvement of phone RMSEs are 2:2ms and 1:1ms or 11% and 5:6%, in English and Mandarin synthesis, respectively. The perceptual test on synthesized English and Mandarin speech further confirms that the refined duration model outperforms the baseline system.


  1. Y.Wu and R.Wang, "HMM-Based Trainable Speech Synthesis for Chinese," Journal of Chinese Information Processing, vol. 20, pp. 75-81, 2006.

Full Paper

Bibliographic reference.  Gao, Boyang / Qian, Yao / Wu, Zhizheng / Soong, Frank K. (2008): "Duration refinement by jointly optimizing state and longer unit likelihood", In INTERSPEECH-2008, 2266-2269.