In this paper, we apply a context-sensitive technique for multimodal emotion recognition based on feature-level fusion of acoustic and visual cues. We use bidirectional Long Short-Term Memory (BLSTM) networks which, unlike most other emotion recognition approaches, exploit long-range contextual information for modeling the evolution of emotion within a conversation. We focus on recognizing dimensional emotional labels, which enables us to classify both prototypical and non-prototypical emotional expressions contained in a large audio-visual database. Subject-independent experiments on various classification tasks reveal that the BLSTM network approach generally prevails over standard classification techniques such as Hidden Markov Models or Support Vector Machines, and achieves F1-measures of the order of 72%, 65%, and 55% for the discrimination of three clusters in emotional space and the distinction between three levels of valence and activation, respectively.
Bibliographic reference. Wöllmer, Martin / Metallinou, Angeliki / Eyben, Florian / Schuller, Björn / Narayanan, Shrikanth S. (2010): "Context-sensitive multimodal emotion recognition from speech and facial expression using bidirectional LSTM modeling", In INTERSPEECH-2010, 2362-2365.