Self-Attention for Speech Emotion Recognition

Lorenzo Tarantino, Philip N. Garner, Alexandros Lazaridis


Speech Emotion Recognition (SER) has been shown to benefit from many of the recent advances in deep learning, including recurrent based and attention based neural network architectures as well. Nevertheless, performance still falls short of that of humans. In this work, we investigate whether SER could benefit from the self-attention and global windowing of the transformer model. We show on the IEMOCAP database that this is indeed the case. Finally, we investigate whether using the distribution of, possibly conflicting, annotations in the training data, as soft targets could outperform a majority voting. We prove that this performance increases with the agreement level of the annotators.


 DOI: 10.21437/Interspeech.2019-2822

Cite as: Tarantino, L., Garner, P.N., Lazaridis, A. (2019) Self-Attention for Speech Emotion Recognition. Proc. Interspeech 2019, 2578-2582, DOI: 10.21437/Interspeech.2019-2822.


@inproceedings{Tarantino2019,
  author={Lorenzo Tarantino and Philip N. Garner and Alexandros Lazaridis},
  title={{Self-Attention for Speech Emotion Recognition}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={2578--2582},
  doi={10.21437/Interspeech.2019-2822},
  url={http://dx.doi.org/10.21437/Interspeech.2019-2822}
}