In this paper, we evaluate a framework of statistical parametric speech synthesis based on Gaussian process regression (GPR) and compare it with those based on hidden Markov model (HMM) and deep neural network (DNN). Recently, for the purpose of improving the performance of HMM-based speech synthesis, novel frameworks using deep architectures have been proposed and have shown their effectiveness. GPR-based speech synthesis is also an alternative framework to HMM-based one, in which the frame-level acoustic features are predicted from frame-level linguistic features, as in DNN-based one. First we examine the clustering level of speech segments such as state, phone, mora, and accent phrase, used for GPR-based synthesis. Then we compare the modeling architecture and performance of GPR with DNN and HMM for statistical parametric speech synthesis. Experimental results show that the GPR-based speech synthesis system gives higher performance than both HMM- and DNN-based ones under the condition using a relatively small size training data of around 40 minutes.
Bibliographic reference. Koriyama, Tomoki / Kobayashi, Takao (2015): "A comparison of speech synthesis systems based on GPR, HMM, and DNN with a small amount of training data", In INTERSPEECH-2015, 3496-3500.