Estimating continuous emotional states from speech as a function of time has traditionally been framed as a regression problem. In this paper, we present a novel approach that moves the problem into the classification domain by discretizing the training labels at different resolutions. We employ a multi-task deep bidirectional long-short term memory (BLSTM) recurrent neural network (RNN) trained with cost-sensitive Cross Entropy loss to model these labels jointly. We introduce an emotion decoding algorithm that incorporates long- and short-term temporal properties of the signal to produce more robust time series estimates. We show that our proposed approach achieves competitive audio-only performance on the RECOLA dataset, relative to previously published works as well as other strong regression baselines. This work provides a link between regression and classification, and contributes an alternative approach for continuous emotion recognition.
Cite as: Le, D., Aldeneh, Z., Provost, E.M. (2017) Discretized Continuous Speech Emotion Recognition with Multi-Task Deep Recurrent Neural Network. Proc. Interspeech 2017, 1108-1112, doi: 10.21437/Interspeech.2017-94
@inproceedings{le17b_interspeech, author={Duc Le and Zakaria Aldeneh and Emily Mower Provost}, title={{Discretized Continuous Speech Emotion Recognition with Multi-Task Deep Recurrent Neural Network}}, year=2017, booktitle={Proc. Interspeech 2017}, pages={1108--1112}, doi={10.21437/Interspeech.2017-94} }