16th Annual Conference of the International Speech Communication Association

Dresden, Germany
September 6-10, 2015

Sub-Band Text-to-Speech Combining Sample-Based Spectrum with Statistically Generated Spectrum

Tadashi Inai (1), Sunao Hara (1), Masanobu Abe (1), Yusuke Ijima (2), Noboru Miyazaki (2), Hideyuki Mizuno (2)

(1) Okayama University, Japan
(2) NTT Corporation, Japan

As described in this paper, we propose a sub-band speech synthesis approach to develop a high quality Text-to-Speech (TTS) system: a sample-based spectrum is used in the high-frequency band and spectrum generated by HMM-based TTS is used in the low-frequency band. Herein, sample-based spectrum means spectrum selected from a phoneme database such that it is the most similar to spectrum generated by HMM-based speech synthesis. A key idea is to compensate over-smoothing caused by statistical procedures by introducing a sample-based spectrum, especially in the high-frequency band. Listening test results show that the proposed method has better performance than HMM-based speech synthesis in terms of clarity. It is at the same level as HMM-based speech synthesis in terms of smoothness. In addition, preference test results among the proposed method, HMM-based speech synthesis, and waveform speech synthesis using 80 min speech data reveal that the proposed method is the most liked.

Full Paper

Bibliographic reference.  Inai, Tadashi / Hara, Sunao / Abe, Masanobu / Ijima, Yusuke / Miyazaki, Noboru / Mizuno, Hideyuki (2015): "Sub-band text-to-speech combining sample-based spectrum with statistically generated spectrum", In INTERSPEECH-2015, 264-268.