ISCA Archive Interspeech 2021
ISCA Archive Interspeech 2021

Time-Frequency Representation Learning with Graph Convolutional Network for Dialogue-Level Speech Emotion Recognition

Jiaxing Liu, Yaodong Song, Longbiao Wang, Jianwu Dang, Ruiguo Yu

With the development of speech emotion recognition (SER), dialogue-level SER (DSER) is more aligned with actual scenarios. In this paper, we propose a DSER approach that includes two stages of representation learning: intra-utterance representation learning and inter-utterance representation learning. In the intra-utterance representation learning stage, traditional convolutional neural network (CNN) has demonstrated great success. However, the basic design of a CNN restricts its ability to model the local and global information in the spectrogram. Therefore, we propose a novel local-global representation learning method for the intra-utterance stage. The local information is learned by a time-frequency convolutional neural network (TFCNN), which we published previously. Here, we propose a time-frequency capsule neural network (TFCap) to model global information that can extract more stable global time-frequency information directly from spectrograms. In the inter-utterance stage, a graph convolutional network (GCN) is introduced to explore the relations between utterances in a dialog. Our proposed methods were evaluated on the IEMOCAP database. The proposed time-frequency based method in the intra-utterance stage achieves an absolute increase of 9.35% compared to CNN. By integrating GCN in the inter-utterance stage, the proposed approach achieves an absolute increase of 4.05% compared to the model in the previous stage.


doi: 10.21437/Interspeech.2021-2067

Cite as: Liu, J., Song, Y., Wang, L., Dang, J., Yu, R. (2021) Time-Frequency Representation Learning with Graph Convolutional Network for Dialogue-Level Speech Emotion Recognition. Proc. Interspeech 2021, 4523-4527, doi: 10.21437/Interspeech.2021-2067

@inproceedings{liu21o_interspeech,
  author={Jiaxing Liu and Yaodong Song and Longbiao Wang and Jianwu Dang and Ruiguo Yu},
  title={{Time-Frequency Representation Learning with Graph Convolutional Network for Dialogue-Level Speech Emotion Recognition}},
  year=2021,
  booktitle={Proc. Interspeech 2021},
  pages={4523--4527},
  doi={10.21437/Interspeech.2021-2067}
}