This paper introduces our ongoing work on generative modeling of speech fundamental frequency (F0) contours for estimating prosodic features from raw speech data. The present F0 contour model is formulated by translating the Fujisaki model, a wellfounded mathematical model representing the control mechanism of vocal fold vibration, into a probabilistic model described as a discrete-time stochastic process. The motivation behind this formulation is two fold. One is to derive a general parameter estimation framework for the Fujisaki model, allowing for the introduction of powerful statistical methods. The other is to construct an automatically trainable version of the Fujisaki model so that in future it can be used to develop a statistical speaking style conversion system or incorporated into existing text-to-speech synthesis systems to improve the naturalness and intelligibility of computer-generated speech. We also briefly introduce a generative model of F0 contours of singing voice developed under the same spirit.
Bibliographic reference. Kameoka, Hirokazu / Yoshizato, Kota / Ishihara, Tatsuma / Ohishi, Yasunori / Kashino, Kunio / Sagayama, Shigeki (2013): "Generative modeling of speech F0 contours", In INTERSPEECH-2013, 1826-1830.