ISCA Archive Interspeech 2008
ISCA Archive Interspeech 2008

Duration refinement by jointly optimizing state and longer unit likelihood

Boyang Gao, Yao Qian, Zhizheng Wu, Frank K. Soong

We refine the duration model in HMM-based TTS by extending the work of Wu [1]. The model is refined by jointly maximizing the duration likelihoods of state, phone and syllable units. Both Gaussian and gamma distributions are employed. In synthesis, the state durations are generated by the same joint optimization procedure. By considering the duration of state and longer units jointly, the accumulation of errors in estimated state durations is regulated in the optimization procedure. Experiments on Mandarin and English databases show that the refined model yields more accurate duration predictions, compared with the baseline state duration model. The improvement of phone RMSEs are 2:2ms and 1:1ms or 11% and 5:6%, in English and Mandarin synthesis, respectively. The perceptual test on synthesized English and Mandarin speech further confirms that the refined duration model outperforms the baseline system.

Y.Wu and R.Wang, "HMM-Based Trainable Speech Synthesis for Chinese," Journal of Chinese Information Processing, vol. 20, pp. 75-81, 2006.

doi: 10.21437/Interspeech.2008-556

Cite as: Gao, B., Qian, Y., Wu, Z., Soong, F.K. (2008) Duration refinement by jointly optimizing state and longer unit likelihood. Proc. Interspeech 2008, 2266-2269, doi: 10.21437/Interspeech.2008-556

  author={Boyang Gao and Yao Qian and Zhizheng Wu and Frank K. Soong},
  title={{Duration refinement by jointly optimizing state and longer unit likelihood}},
  booktitle={Proc. Interspeech 2008},