As described in this paper, we propose a sub-band speech synthesis approach to develop a high quality Text-to-Speech (TTS) system: a sample-based spectrum is used in the high-frequency band and spectrum generated by HMM-based TTS is used in the low-frequency band. Herein, sample-based spectrum means spectrum selected from a phoneme database such that it is the most similar to spectrum generated by HMM-based speech synthesis. A key idea is to compensate over-smoothing caused by statistical procedures by introducing a sample-based spectrum, especially in the high-frequency band. Listening test results show that the proposed method has better performance than HMM-based speech synthesis in terms of clarity. It is at the same level as HMM-based speech synthesis in terms of smoothness. In addition, preference test results among the proposed method, HMM-based speech synthesis, and waveform speech synthesis using 80 min speech data reveal that the proposed method is the most liked.
Bibliographic reference. Inai, Tadashi / Hara, Sunao / Abe, Masanobu / Ijima, Yusuke / Miyazaki, Noboru / Mizuno, Hideyuki (2015): "Sub-band text-to-speech combining sample-based spectrum with statistically generated spectrum", In INTERSPEECH-2015, 264-268.