Fifth ISCA ITRW on Speech Synthesis

June 14-16, 2004
Pittsburgh, PA, USA

F0 Modeling with Multi-Layer Additive Modeling Based on a Statistical Learning Technique

Shinsuke Sakai

Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA

In this paper, we describe research in fundamental frequency modeling based on a statistical learning technique called additive models. A two-layer additive F0 model consists of a long-term, intonational phrase-level component, and a short-term, accentual phrase-level component. It can be learned from the data using a backfitting algorithm, an optimizer of a penalized leastsquare criterion defined on the model. It estimates two components simultaneously by iteratively applying cubic spline smoothers. To investigate the further flexibility of the model, we incorporated a third additive term that represents a contextual effect on an accentual phrase, and confirmed the improvements in terms of RMS errors. Experimental results on a 7,000 utterance Japanese speech corpus shows an achievement of F0 RMS errors of 28.5 and 29.3 Hz on the training and test data, respectively, with corresponding correlation coefficients of 0.81 and 0.79.

Full Paper

Bibliographic reference.  Sakai, Shinsuke (2004): "F0 modeling with multi-layer additive modeling based on a statistical learning technique", In SSW5-2004, 151-154.