The intelligibility of HMM-based TTS can reach that of the original speech. However, HMM-based TTS is far from natural. On the contrary, unit selection TTS is the most-natural sounding TTS currently. However, its intelligibility and naturalness on segmental duration and timing are not stable. Additionally, unit selection needs to store a huge amount of data for concatenation. Recently, hybrid approaches between these two TTS, i.e. the HMM trajectory tiling (HTT) TTS, have been studied to take advantages of both unit selection and HMM-based TTS. However, such methods still require a huge amount of data for rendering. In this paper, a hybrid TTS among unit selection, HMM-based TTS, and Temporal Decomposition (TD) is proposed motivating to take advantages of both unit selection and HMM-based TTS under limited data conditions. Here, TD is a sparse representation of speech that decomposes a spectral or prosodic sequence into two mutual independent components: static event targets and correspondent dynamic event functions. Previous studies show that the dynamic event functions are related to the perception of speech intelligibility, one core linguistic or content information, while the static event targets convey non-linguistic or style information. Therefore, by borrowing the concepts of unit selection to render the event targets of the spectral sequence, and directly borrowing the prosodic sequences and the dynamic event functions of the spectral sequence generated by HMMbased TTS, the naturalness and the intelligibility of the proposed hybrid TTS can reach the naturalness of unit selection, and the intelligibility of HMM-based TTS, respec- tively. Due to the sparse representation of TD, the proposed hybrid TTS can also ensure a small amount of data for rendering, which suitable for limited data conditions. The experimental results with a small Vietnamese dataset, simulated to be a “limited data condition”, show that the proposed hybrid TTS outperformed all HMM-based TTS, unit selection, HTT TTS under a limited data conditions.
Index Terms: TTS, unit selection, HMM-based, Temporal Decomposition, HTT
Cite as: Phung, T.-N., Luong, C.M., Akagi, M. (2013) A hybrid TTS between unit selection and HMM-based TTS under limited data conditions. Proc. 8th ISCA Workshop on Speech Synthesis (SSW 8), 279-284
@inproceedings{phung13_ssw, author={Trung-Nghia Phung and Chi Mai Luong and Masato Akagi}, title={{A hybrid TTS between unit selection and HMM-based TTS under limited data conditions}}, year=2013, booktitle={Proc. 8th ISCA Workshop on Speech Synthesis (SSW 8)}, pages={279--284} }