This paper proposes a new F0 model for speech synthesis based on the parameterization of the logF0 contour of the syllables. This parameterization consists of the N-order discrete cosine transform (DCT) plus some additional parameters such as the gradient of the syllable average pitch. A statistical model of the syllable pitch contour is then created by clustering the parameterized vectors with a decision tree. Similar statistical models are also created for other linguistic levels other than the syllable. For synthesis, the statistical model of each level is used to define a log-likelihood function for the input text. These functions are then weighted and added into a global log-likelihood function which is then maximized with respect to the DCT coefficients of the syllable model. The final logF0 contour is obtained from the inverse transformation of the syllable DCT coefficients. A subjective test showed a clear preference for the proposed model against our previous HMM-based baseline.
Bibliographic reference. Latorre, Javier / Akamine, Masami (2008): "Multilevel parametric-base F0 model for speech synthesis", In INTERSPEECH-2008, 2274-2277.