This paper presents a rich context modeling approach to high quality HMM-based speech synthesis. We first analyze the oversmoothing problem in conventional decision tree tying-based HMM, and then propose to model the training speech tokens with rich context models. Special training procedure is adopted for reliable estimation of the rich context model parameters. In synthesis, a search algorithm following a context-based pre-selection is performed to determine the optimal rich context model sequence which generates natural and crisp output speech. Experimental results show that spectral envelopes synthesized by the rich context models are with crisper formant structures and evolve with richer details than those obtained by the conventional models. The speech quality improvement is also perceived by listeners in a subjective preference test, in which 76% of the sentences synthesized using rich context modeling are preferred.
Bibliographic reference. Yan, Zhi-Jie / Qian, Yao / Soong, Frank K. (2009): "Rich context modeling for high quality HMM-based TTS", In INTERSPEECH-2009, 1755-1758.