11th Annual Conference of the International Speech Communication Association

Makuhari, Chiba, Japan
September 26-30. 2010

Context-Sensitive Multimodal Emotion Recognition from Speech and Facial Expression Using Bidirectional LSTM Modeling

Martin Wöllmer (1), Angeliki Metallinou (2), Florian Eyben (1), Björn Schuller (1), Shrikanth S. Narayanan (2)

(1) Technische Universität München, Germany
(2) University of Southern California, USA

In this paper, we apply a context-sensitive technique for multimodal emotion recognition based on feature-level fusion of acoustic and visual cues. We use bidirectional Long Short-Term Memory (BLSTM) networks which, unlike most other emotion recognition approaches, exploit long-range contextual information for modeling the evolution of emotion within a conversation. We focus on recognizing dimensional emotional labels, which enables us to classify both prototypical and non-prototypical emotional expressions contained in a large audio-visual database. Subject-independent experiments on various classification tasks reveal that the BLSTM network approach generally prevails over standard classification techniques such as Hidden Markov Models or Support Vector Machines, and achieves F1-measures of the order of 72%, 65%, and 55% for the discrimination of three clusters in emotional space and the distinction between three levels of valence and activation, respectively.

Full Paper

Bibliographic reference.  Wöllmer, Martin / Metallinou, Angeliki / Eyben, Florian / Schuller, Björn / Narayanan, Shrikanth S. (2010): "Context-sensitive multimodal emotion recognition from speech and facial expression using bidirectional LSTM modeling", In INTERSPEECH-2010, 2362-2365.