Towards Robust Speech Emotion Recognition Using Deep Residual Networks for Speech Enhancement

Andreas Triantafyllopoulos, Gil Keren, Johannes Wagner, Ingmar Steiner, Björn W. Schuller


The use of deep learning (DL) architectures for speech enhancement has recently improved the robustness of voice applications under diverse noise conditions. These improvements are usually evaluated based on the perceptual quality of the enhanced audio or on the performance of automatic speech recognition (ASR) systems. We are interested instead in the usefulness of these algorithms in the field of speech emotion recognition (SER), and specifically in whether an enhancement architecture can effectively remove noise while preserving enough information for an SER algorithm to accurately identify emotion in speech. We first show how a scalable DL architecture can be trained to enhance audio signals in a large number of unseen environments, and go on to show how that can benefit common SER pipelines in terms of noise robustness. Our results show that incorporating a speech enhancement architecture is beneficial, especially for low signal-to-noise ratio (SNR) conditions.


 DOI: 10.21437/Interspeech.2019-1811

Cite as: Triantafyllopoulos, A., Keren, G., Wagner, J., Steiner, I., Schuller, B.W. (2019) Towards Robust Speech Emotion Recognition Using Deep Residual Networks for Speech Enhancement. Proc. Interspeech 2019, 1691-1695, DOI: 10.21437/Interspeech.2019-1811.


@inproceedings{Triantafyllopoulos2019,
  author={Andreas Triantafyllopoulos and Gil Keren and Johannes Wagner and Ingmar Steiner and Björn W. Schuller},
  title={{Towards Robust Speech Emotion Recognition Using Deep Residual Networks for Speech Enhancement}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={1691--1695},
  doi={10.21437/Interspeech.2019-1811},
  url={http://dx.doi.org/10.21437/Interspeech.2019-1811}
}