The hybrid speech synthesis system, which combines the hidden Markov model and unit selection method, has become an additional main stream in state-of-the-art TTS systems. However, traditional Viterbi algorithm is based on global minimization of a cost function and the procedure can end up selecting some poor-quality units with larger local errors, which can hardly be tolerated by the listeners. In Mandarin and many other languages, the naturalness of the region of consecutive voiced speech segments (CVS) is more essential to the overall quality of the synthetic speech. Consequently, in this paper, we proposed to use a hierarchical Viterbi algorithm which involves two rounds of Viterbi search: one is for the sub-paths in the CVS regions; the other is for the utterance path connecting all the sub-paths. In the proposed technique, we defined CVS Region as a region which is formed by two or more voiced phones, and whose observation of pitch has a continuous value. Subjective evaluations suggest that the use of hierarchical Viterbi algorithm in the Mandarin hybrid speech synthesis system outperforms the use of traditional algorithm in both the naturalness and speech quality of synthetic speech.
Bibliographic reference. Zhang, Ran / Wen, Zhengqi / Tao, Jianhua / Li, Ya / Liu, Bing / Lou, Xiaoyan (2014): "A hierarchical viterbi algorithm for Mandarin hybrid speech synthesis system", In INTERSPEECH-2014, 795-799.