Learning Temporal Clusters Using Capsule Routing for Speech Emotion Recognition

Md. Asif Jalal, Erfan Loweimi, Roger K. Moore, Thomas Hain


Emotion recognition from speech plays a significant role in adding emotional intelligence to machines and making human-machine interaction more natural. One of the key challenges from machine learning standpoint is to extract patterns which bear maximum correlation with the emotion information encoded in this signal while being as insensitive as possible to other types of information carried by speech. In this paper, we propose a novel temporal modelling framework for robust emotion classification using bidirectional long short-term memory network (BLSTM), CNN and Capsule networks. The BLSTM deals with the temporal dynamics of the speech signal by effectively representing forward/backward contextual information while the CNN along with the dynamic routing of the Capsule net learn temporal clusters which altogether provide a state-of-the-art technique for classifying the extracted patterns. The proposed approach was compared with a wide range of architectures on the FAU-Aibo and RAVDESS corpora and remarkable gain over state-of-the-art systems were obtained. For FAO-Aibo and RAVDESS 77.6% and 56.2% accuracy was achieved, respectively, which is 3% and 14% (absolute) higher than the best-reported result for the respective tasks.


 DOI: 10.21437/Interspeech.2019-3068

Cite as: Jalal, M.A., Loweimi, E., Moore, R.K., Hain, T. (2019) Learning Temporal Clusters Using Capsule Routing for Speech Emotion Recognition. Proc. Interspeech 2019, 1701-1705, DOI: 10.21437/Interspeech.2019-3068.


@inproceedings{Jalal2019,
  author={Md. Asif Jalal and Erfan Loweimi and Roger K. Moore and Thomas Hain},
  title={{Learning Temporal Clusters Using Capsule Routing for Speech Emotion Recognition}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={1701--1705},
  doi={10.21437/Interspeech.2019-3068},
  url={http://dx.doi.org/10.21437/Interspeech.2019-3068}
}