16th Annual Conference of the International Speech Communication Association

Dresden, Germany
September 6-10, 2015

Using Deep Bidirectional Recurrent Neural Networks for Prosodic-Target Prediction in a Unit-Selection Text-to-Speech System

Raul Fernandez (1), Asaf Rendel (2), Bhuvana Ramabhadran (1), Ron Hoory (2)

(1) IBM T.J. Watson Research Center, USA
(2) IBM Research Haifa, Israel

Deeply-stacked Bidirectional Recurrent Neural Networks (BiRNNs) are able to capture complex, short- and long-term, context dependencies between predictors and targets due to the non-linear dependency they introduce on the entire observation when predicting a target, thanks to the use of recurrent hidden layers that accumulate information from all preceding and future observations. This aspect of the model makes them desirable for tasks such as the prediction of prosodic contours for text-to-speech systems, where the surface prosody can be a result of the interaction between local and non-local features. Although previous work has demonstrated that they attain state-of-the-art performance for this task within a parametric synthesis framework, their use within unit-selection synthesis systems remains unexplored. In this work we deploy this class of models within a unit selection system, investigate their effect on the outcome of the unit search, and perceptually evaluate it against the baseline (decision-tree-based) approach.

Full Paper

Bibliographic reference.  Fernandez, Raul / Rendel, Asaf / Ramabhadran, Bhuvana / Hoory, Ron (2015): "Using deep bidirectional recurrent neural networks for prosodic-target prediction in a unit-selection text-to-speech system", In INTERSPEECH-2015, 1606-1610.