Continuous Emotion Recognition in Speech — Do We Need Recurrence?

Maximilian Schmitt, Nicholas Cummins, Björn W. Schuller

Emotion recognition in speech is a meaningful task in affective computing and human-computer interaction. As human emotion is a frequently changing state, it is usually represented as a densely sampled time series of emotional dimensions, typically arousal and valence. For this, recurrent neural network (RNN) architectures are employed by default when it comes to modelling the contours with deep learning approaches. However, the amount of temporal context required is questionable, and it has not yet been clarified whether the consideration of long-term dependencies is actually beneficial. In this contribution, we demonstrate that RNNs are not necessary to accomplish the task of time-continuous emotion recognition. Indeed, results gained indicate that deep neural networks incorporating less complex convolutional layers can provide more accurate models. We highlight the pros and cons of recurrent and non-recurrent approaches and evaluate our methods on the public SEWA database, which was used as a benchmark in the 2017 and 2018 editions of the Audio-Visual Emotion Challenge.

 DOI: 10.21437/Interspeech.2019-2710

Cite as: Schmitt, M., Cummins, N., Schuller, B.W. (2019) Continuous Emotion Recognition in Speech — Do We Need Recurrence?. Proc. Interspeech 2019, 2808-2812, DOI: 10.21437/Interspeech.2019-2710.

  author={Maximilian Schmitt and Nicholas Cummins and Björn W. Schuller},
  title={{Continuous Emotion Recognition in Speech — Do We Need Recurrence?}},
  booktitle={Proc. Interspeech 2019},