Fifth ISCA ITRW on Speech Synthesis
June 14-16, 2004
In this paper, we describe research in fundamental frequency modeling based on a statistical learning technique called additive models. A two-layer additive F0 model consists of a long-term, intonational phrase-level component, and a short-term, accentual phrase-level component. It can be learned from the data using a backfitting algorithm, an optimizer of a penalized leastsquare criterion defined on the model. It estimates two components simultaneously by iteratively applying cubic spline smoothers. To investigate the further flexibility of the model, we incorporated a third additive term that represents a contextual effect on an accentual phrase, and confirmed the improvements in terms of RMS errors. Experimental results on a 7,000 utterance Japanese speech corpus shows an achievement of F0 RMS errors of 28.5 and 29.3 Hz on the training and test data, respectively, with corresponding correlation coefficients of 0.81 and 0.79.
Bibliographic reference. Sakai, Shinsuke (2004): "F0 modeling with multi-layer additive modeling based on a statistical learning technique", In SSW5-2004, 151-154.