INTERSPEECH 2012
13th Annual Conference of the International Speech Communication Association

Portland, OR, USA
September 9-13, 2012

Emotion Recognition using Acoustic and Lexical Features

Viktor Rozgić (1), Sankaranarayanan Ananthakrishnan (1), Shirin Saleem (1), Rohit Kumar (1), Aravind Namandi Vembu (2), Rohit Prasad (1)

(1) Speech Language and Multimedia Technologies, Raytheon BBN Technologies, Cambridge, MA, USA
(2) Ming Hseih Department of Electrical Engineering, University of Southern California, Los Angeles, CA, USA

In this paper we present an innovative approach for utterance-level emotion recognition by fusing acoustic features with lexical features extracted from automatic speech recognition (ASR) output. The acoustic features are generated by combining: (1) a novel set of features that are derived from segmental Mel Frequency Cepstral Coefficients (MFCC) scored against emotion-dependent Gaussian mixture models, and (2) statistical functionals of low-level feature descriptors such as intensity, fundamental frequency, jitter, shimmer, etc. These acoustic features are fused with two types of lexical features extracted from the ASR output: (1) presence/absence of word stems, and (2) bag-of-words sentiment categories. The combined feature set is used to train support vector machines (SVM) for emotion classification. We demonstrate the efficacy of our approach by performing four-way emotion recognition on the University of Southern California's Interactive Emotional Motion Capture (USC-IEMOCAP) corpus. Our experiments show that the fusion of acoustic and lexical features delivers an emotion recognition accuracy of 65.7%, outperforming the previously reported best results on this challenging dataset.

Index Terms: emotion recognition, model-based acoustic features, lexical features

Full Paper

Bibliographic reference.  Rozgić, Viktor / Ananthakrishnan, Sankaranarayanan / Saleem, Shirin / Kumar, Rohit / Vembu, Aravind Namandi / Prasad, Rohit (2012): "Emotion recognition using acoustic and lexical features", In INTERSPEECH-2012, 366-369.