Representation Learning for Speech Emotion Recognition

Sayan Ghosh, Eugene Laksana, Louis-Philippe Morency, Stefan Scherer


Speech emotion recognition is an important problem with applications as varied as human-computer interfaces and affective computing. Previous approaches to emotion recognition have mostly focused on extraction of carefully engineered features and have trained simple classifiers for the emotion task. There has been limited effort at representation learning for affect recognition, where features are learnt directly from the signal waveform or spectrum. Prior work also does not investigate the effect of transfer learning from affective attributes such as valence and activation to categorical emotions. In this paper, we investigate emotion recognition from spectrogram features extracted from the speech and glottal flow signals; spectrogram encoding is performed by a stacked autoencoder and an RNN (Recurrent Neural Network) is used for classification of four primary emotions. We perform two experiments to improve RNN training : (1) Representation Learning — Model training on the glottal flow signal to investigate the effect of speaker and phonetic invariant features on classification performance (2) Transfer Learning — RNN training on valence and activation, which is adapted to a four emotion classification task. On the USC-IEMOCAP dataset, our proposed approach achieves a performance comparable to the state of the art speech emotion recognition systems.


DOI: 10.21437/Interspeech.2016-692

Cite as

Ghosh, S., Laksana, E., Morency, L., Scherer, S. (2016) Representation Learning for Speech Emotion Recognition. Proc. Interspeech 2016, 3603-3607.

Bibtex
@inproceedings{Ghosh+2016,
author={Sayan Ghosh and Eugene Laksana and Louis-Philippe Morency and Stefan Scherer},
title={Representation Learning for Speech Emotion Recognition},
year=2016,
booktitle={Interspeech 2016},
doi={10.21437/Interspeech.2016-692},
url={http://dx.doi.org/10.21437/Interspeech.2016-692},
pages={3603--3607}
}