Improved End-to-End Speech Emotion Recognition Using Self Attention Mechanism and Multitask Learning

Yuanchao Li, Tianyu Zhao, Tatsuya Kawahara


Accurately recognizing emotion from speech is a necessary yet challenging task due to the variability in speech and emotion. In this paper, we propose a speech emotion recognition (SER) method using end-to-end (E2E) multitask learning with self attention to deal with several issues. First, we extract features directly from speech spectrogram instead of using traditional hand-crafted features to better represent emotion. Second, we adopt self attention mechanism to focus on the salient periods of emotion in speech utterances. Finally, giving consideration to mutual features between emotion and gender classification tasks, we incorporate gender classification as an auxiliary task by using multitask learning to share useful information with emotion classification task. Evaluation on IEMOCAP (a commonly used database for SER research) demonstrates that the proposed method outperforms the state-of-the-art methods and improves the overall accuracy by an absolute of 7.7% compared to the best existing result.


 DOI: 10.21437/Interspeech.2019-2594

Cite as: Li, Y., Zhao, T., Kawahara, T. (2019) Improved End-to-End Speech Emotion Recognition Using Self Attention Mechanism and Multitask Learning. Proc. Interspeech 2019, 2803-2807, DOI: 10.21437/Interspeech.2019-2594.


@inproceedings{Li2019,
  author={Yuanchao Li and Tianyu Zhao and Tatsuya Kawahara},
  title={{Improved End-to-End Speech Emotion Recognition Using Self Attention Mechanism and Multitask Learning}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={2803--2807},
  doi={10.21437/Interspeech.2019-2594},
  url={http://dx.doi.org/10.21437/Interspeech.2019-2594}
}