11th Annual Conference of the International Speech Communication Association

Makuhari, Chiba, Japan
September 26-30. 2010

An HMM Trajectory Tiling (HTT) Approach to High Quality TTS

Yao Qian (1), Zhi-Jie Yan (1), Yijian Wu (2), Frank K. Soong (1), Xin Zhuang (1), Shengyi Kong (1)

(1) Microsoft Research, China
(2) Microsoft, China

The current state-of-art HMM-based speech synthesis can produce highly intelligible speech but still carries the intrinsic vocoding flavor due to its simple excitation model. In this paper, we propose a new HMM trajectory tiling approach to high quality TTS. Trajectory generated by the refined HMM is used to guide the search for the closest waveform segment “tiles” in rendering highly intelligible and natural sounding speech. Normalized distances between the HMM trajectory and those of waveform unit candidates are used for constructing a unit sausage. Normalized cross-correlation is used to finding the best unit sequence in the sausage. The sequence serves as the best segment tiles to track closely the HMM trajectory guide. Tested on the two British English databases, our approach can render natural sounding speech without sacrificing the high intelligibility achieved by HMM-based TTS. They are confirmed subjectively by the corresponding AB preference and intelligibility tests.

