ISCA Archive Interspeech 2019
ISCA Archive Interspeech 2019

Robust Speech Emotion Recognition Under Different Encoding Conditions

Christopher Oates, Andreas Triantafyllopoulos, Ingmar Steiner, Björn W. Schuller

In an era where large speech corpora annotated for emotion are hard to come by, and especially ones where emotion is expressed freely instead of being acted, the importance of using free online sources for collecting such data cannot be overstated. Most of those sources, however, contain encoded audio due to storage and bandwidth constraints, often in very low bitrates. In addition, with the increased industry interest on voice-based applications, it is inevitable that speech emotion recognition (SER) algorithms will soon find their way into production environments, where the audio might be encoded in a different bitrate than the one available during training. Our contribution is threefold. First, we show that encoded audio still contains enough relevant information for robust SER. Next, we investigate the effects of mismatched encoding conditions in the training and test set both for traditional machine learning algorithms built on hand-crafted features and modern end-to-end methods. Finally, we investigate the robustness of those algorithms in the multi-condition scenario, where the training set is augmented with encoded audio, but still differs from the training set. Our results indicate that end-to-end methods are more robust even in the more challenging scenario of mismatched conditions.

doi: 10.21437/Interspeech.2019-1658

Cite as: Oates, C., Triantafyllopoulos, A., Steiner, I., Schuller, B.W. (2019) Robust Speech Emotion Recognition Under Different Encoding Conditions. Proc. Interspeech 2019, 3935-3939, doi: 10.21437/Interspeech.2019-1658

  author={Christopher Oates and Andreas Triantafyllopoulos and Ingmar Steiner and Björn W. Schuller},
  title={{Robust Speech Emotion Recognition Under Different Encoding Conditions}},
  booktitle={Proc. Interspeech 2019},