This paper introduces frame-based Gaussian process regression (GPR) into phone/syllable duration modeling for Thai speech synthesis. The GPR model is designed for predicting frame-level acoustic features using corresponding frame information, which includes relative position in each unit of utterance structure and linguistic information such as tone type and part of speech. Although the GPR-based prediction can be applied to a phone duration model, the use of phone duration model only is not always sufficient to generate natural sounding speech. Specifically, in some languages including Thai, syllable durations affect the perception of sentence structure. In this paper, we propose a duration prediction technique using a multi-level model which includes syllable and phone levels for prediction. In the technique, first, syllable durations are predicted, and then they are used as additional contexts in phone-level model to generate phone duration for synthesizing. Objective and subjective evaluation results show that GPR-based modeling with multi-level model for duration prediction outperforms the conventional HMM-based speech synthesis.
Bibliographic reference. Moungsri, Decha / Koriyama, Tomoki / Kobayashi, Takao (2015): "Duration prediction using multi-level model for GPR-based speech synthesis", In INTERSPEECH-2015, 1591-1595.