Models for the prediction of prosodic events, such as pitch accents and phrasal boundaries, often rely on machine learning models that combine a set of input features aggregated over a finite, and usually short, number of observations to model context. Dynamic models go a step further by explicitly incorporating a model of state sequence, but even then, many practical implementations are limited to a low-order finite-state machine. This Markovian assumption, however, does not properly address the interaction between short- and long-term contextual factors that is known to affect the realization and placement of these prosodic events. Bidirectional Recurrent Neural Networks (BiRNNs) are a class of models that overcome this limitation by predicting the outputs as a function of a state variable that accumulates information over the entire input sequence, and by stacking several layers to form a deep architecture able to extract more structure from the input features. These models have already demonstrated state-of-the-art performance on some prosodic regression tasks. In this work we examine a new application of BiRNNs to the task of classifying categorical prosodic events, and demonstrate that they outperform baseline systems.
Bibliographic reference. Rosenberg, Andrew / Fernandez, Raul / Ramabhadran, Bhuvana (2015): "Modeling phrasing and prominence using deep recurrent learning", In INTERSPEECH-2015, 3066-3070.