14thAnnual Conference of the International Speech Communication Association

Lyon, France
August 25-29, 2013

Statistical Nonparametric Speech Synthesis Using Sparse Gaussian Processes

Tomoki Koriyama, Takashi Nose, Takao Kobayashi

Tokyo Institute of Technology, Japan

This paper proposes a statistical nonparametric speech synthesis technique based on a sparse Gaussian process regression (GPR). In our previous study, we proposed GPR-based speech synthesis where each frame of synthesis units is modeled by a regression of Gaussian processes. Preliminary experiments of synthesizing several phones including both vowels and consonants showed a potential of the technique. In this paper, the previous work is extended to full-sentence speech synthesis using sparse GPs and context modification. Specifically, cluster-based sparse Gaussian processes such as local GPs and partially independent conditional (PIC) approximation are examined as a computationally feasible approach. Moreover, frame-level context is extended to include not only a position context from a current phone but also adjacent phones to generate smoothly changing speech parameters. Objective and subjective evaluation results show that the proposed technique outperforms the HMM-based speech synthesis with minimum generation error training.

Full Paper

Bibliographic reference.  Koriyama, Tomoki / Nose, Takashi / Kobayashi, Takao (2013): "Statistical nonparametric speech synthesis using sparse Gaussian processes", In INTERSPEECH-2013, 1072-1076.