16th Annual Conference of the International Speech Communication Association

Dresden, Germany
September 6-10, 2015

High-Level Feature Representation Using Recurrent Neural Network for Speech Emotion Recognition

Jinkyu Lee (1), Ivan Tashev (2)

(1) Yonsei University, Korea
(2) Microsoft, USA

This paper presents a speech emotion recognition system using a recurrent neural network (RNN) model trained by an efficient learning algorithm. The proposed system takes into account the long-range contextual effect and the uncertainty of emotional label expressions. To extract high-level representation of emotional states with regard to its temporal dynamics, a powerful learning method with a bidirectional long short-term memory (BLSTM) structure is adopted. To overcome the uncertainty of emotional labels, such that all frames in the same utterance are mapped to the same emotional label, it is assumed that the label of each frame is regarded as a sequence of random variables. The sequences are then trained by the proposed learning algorithm. The weighted accuracy of the proposed emotion recognition system is improved up to 12% compared to the DNN-ELM-based emotion recognition system used as a baseline.

Full Paper

Bibliographic reference.  Lee, Jinkyu / Tashev, Ivan (2015): "High-level feature representation using recurrent neural network for speech emotion recognition", In INTERSPEECH-2015, 1537-1540.