Auditory-Visual Speech Processing (AVSP) 2013

Annecy, France
August 29 - September 1, 2013

Predicting Head Motion from Prosodic and Linguistic Features

Angelika Hönemann (1), Diego Evin (2,3), Alejandro J. Hadad (3), Hansjörg Mixdorff (1), Sascha Fagel (4)

(1) Beuth University Berlin, Berlin, Germany
(2) INIGEM - Universidad de Buenos Aires- CONICET, Buenos Aires, Argentina
(3) Universidad Nacional de Entre Ríos, Facultad de Ingeniería, Oro Verde, Argentina
(4) zoobe message entertainment GmbH, Berlin, Germany

This paper describes an approach to predict non-verbal cues from speech-related features. Our previous investigations of audiovisual speech showed that there are strong correlations between the two modalities. In this work we developed two models using different kinds of Recurrent Artificial Neural Networks: Elman and NARX, to predict parameters of activity for head motion using linguistic and prosodic inputs, and compared their performance. Prosodic inputs included F0 and intensity, while linguistic parameters included the former plus additional information such as the type of syllables, phrases, and different relations between them. Using speaker specific models for six subjects, performance measures in terms of root mean square error (RMSE) showed that there are significant differences between the models with respect to the input parameters, and that NARX network outperformed the Elman network on the prediction task.

Index Terms: predicting head motion, audiovisual speech, timedelayed NARX, Elman NN, linguistic vs. prosodic features

Full Paper

Bibliographic reference.  Hönemann, Angelika / Evin, Diego / Hadad, Alejandro J. / Mixdorff, Hansjörg / Fagel, Sascha (2013): "Predicting head motion from prosodic and linguistic features", In AVSP-2013, 27-30.