15th Annual Conference of the International Speech Communication Association

September 14-18, 2014

Prosody Contour Prediction with Long Short-Term Memory, Bi-Directional, Deep Recurrent Neural Networks

Raul Fernandez (1), Asaf Rendel (2), Bhuvana Ramabhadran (1), Ron Hoory (2)

(1) IBM T.J. Watson Research Center, USA
(2) IBM Research Haifa, Israel

Deep Neural Networks (DNNs) have been shown to provide state-of-the-art performance over other baseline models in the task of predicting prosodic targets from text in a speech-synthesis system. However, prosody prediction can be affected by an interaction of short- and long-term contextual factors that a static model that depends on a fixed-size context window can fail to properly capture. In this work, we look at a recurrent formulation of neural networks (RNNs) that are deep in time and can store state information from an arbitrarily large input history when making a prediction. We show that RNNs provide improved performance over DNNs of comparable size in terms of various objective metrics for a variety of prosodic streams (notably, a relative reduction of about 6% in F0 mean-square error accompanied by a relative increase of about 14% in F0 variance), as well as in terms of perceptual quality assessed through mean-opinion-score listening tests.

Full Paper

Bibliographic reference.  Fernandez, Raul / Rendel, Asaf / Ramabhadran, Bhuvana / Hoory, Ron (2014): "Prosody contour prediction with long short-term memory, bi-directional, deep recurrent neural networks", In INTERSPEECH-2014, 2268-2272.