We refine the duration model in HMM-based TTS by extending the work of Wu . The model is refined by jointly maximizing the duration likelihoods of state, phone and syllable units. Both Gaussian and gamma distributions are employed. In synthesis, the state durations are generated by the same joint optimization procedure. By considering the duration of state and longer units jointly, the accumulation of errors in estimated state durations is regulated in the optimization procedure. Experiments on Mandarin and English databases show that the refined model yields more accurate duration predictions, compared with the baseline state duration model. The improvement of phone RMSEs are 2:2ms and 1:1ms or 11% and 5:6%, in English and Mandarin synthesis, respectively. The perceptual test on synthesized English and Mandarin speech further confirms that the refined duration model outperforms the baseline system.
Bibliographic reference. Gao, Boyang / Qian, Yao / Wu, Zhizheng / Soong, Frank K. (2008): "Duration refinement by jointly optimizing state and longer unit likelihood", In INTERSPEECH-2008, 2266-2269.