Discretized Continuous Speech Emotion Recognition with Multi-Task Deep Recurrent Neural Network

Duc Le, Zakaria Aldeneh, Emily Mower Provost


Estimating continuous emotional states from speech as a function of time has traditionally been framed as a regression problem. In this paper, we present a novel approach that moves the problem into the classification domain by discretizing the training labels at different resolutions. We employ a multi-task deep bidirectional long-short term memory (BLSTM) recurrent neural network (RNN) trained with cost-sensitive Cross Entropy loss to model these labels jointly. We introduce an emotion decoding algorithm that incorporates long- and short-term temporal properties of the signal to produce more robust time series estimates. We show that our proposed approach achieves competitive audio-only performance on the RECOLA dataset, relative to previously published works as well as other strong regression baselines. This work provides a link between regression and classification, and contributes an alternative approach for continuous emotion recognition.


 DOI: 10.21437/Interspeech.2017-94

Cite as: Le, D., Aldeneh, Z., Provost, E.M. (2017) Discretized Continuous Speech Emotion Recognition with Multi-Task Deep Recurrent Neural Network. Proc. Interspeech 2017, 1108-1112, DOI: 10.21437/Interspeech.2017-94.


@inproceedings{Le2017,
  author={Duc Le and Zakaria Aldeneh and Emily Mower Provost},
  title={Discretized Continuous Speech Emotion Recognition with Multi-Task Deep Recurrent Neural Network},
  year=2017,
  booktitle={Proc. Interspeech 2017},
  pages={1108--1112},
  doi={10.21437/Interspeech.2017-94},
  url={http://dx.doi.org/10.21437/Interspeech.2017-94}
}