Speech Emotion Recognition in Dyadic Dialogues with Attentive Interaction Modeling

Jinming Zhao, Shizhe Chen, Jingjun Liang, Qin Jin


In dyadic human-human interactions, a more complex interaction scenario, a person’s emotional state can be influenced by both self emotional evolution and the interlocutor’s behaviors. However, previous speech emotion recognition studies infer the speaker’s emotional state mainly based on the targeted speech segment without considering the above two contextual factors. In this paper, we propose an Attentive Interaction Model (AIM) to capture both self- and interlocutor-context to enhance the speech emotion recognition in the dyadic dialog. The model learns to dynamically focus on long-term relevant contexts of the speaker and the interlocutor via the self-attention mechanism and fuse the adaptive context with the present behavior to predict the current emotional state. We carry out extensive experiments on the IEMOCAP corpus for dimensional emotion recognition in arousal and valence. Our model achieves on par performance with baselines for arousal recognition and significantly outperforms baselines for valence recognition, which demonstrates the effectiveness of the model to select useful contexts for emotion recognition in dyadic interactions.


 DOI: 10.21437/Interspeech.2019-2103

Cite as: Zhao, J., Chen, S., Liang, J., Jin, Q. (2019) Speech Emotion Recognition in Dyadic Dialogues with Attentive Interaction Modeling. Proc. Interspeech 2019, 1671-1675, DOI: 10.21437/Interspeech.2019-2103.


@inproceedings{Zhao2019,
  author={Jinming Zhao and Shizhe Chen and Jingjun Liang and Qin Jin},
  title={{Speech Emotion Recognition in Dyadic Dialogues with Attentive Interaction Modeling}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={1671--1675},
  doi={10.21437/Interspeech.2019-2103},
  url={http://dx.doi.org/10.21437/Interspeech.2019-2103}
}