Speech Emotion Recognition by Combining Amplitude and Phase Information Using Convolutional Neural Network

Lili Guo, Longbiao Wang, Jianwu Dang, Linjuan Zhang, Haotian Guan, Xiangang Li


Previous studies of speech emotion recognition utilize convolutional neural network (CNN) directly on amplitude spectrogram to extract features. CNN combines with bidirectional long short term memory (BLSTM) has become the state-of-the-art model. However, phase information has been ignored in this model. The importance of phase information in speech processing field is gathering attention. In this paper, we propose feature extraction of amplitude spectrogram and phase information using CNN for speech emotion recognition. The modified group delay cepstral coefficient (MGDCC) and relative phase are used as phase information. Firstly, we analyze the influence of phase information on speech emotion recognition. Then we design a CNN-based feature representation using amplitude and phase information. Finally, experiments were conducted on EmoDB to validate the effectiveness of phase information. Integrating amplitude spectrogram with phase information, the relative emotion error recognition rates are reduced by over 33% in comparison with using only amplitude-based feature.


 DOI: 10.21437/Interspeech.2018-2156

Cite as: Guo, L., Wang, L., Dang, J., Zhang, L., Guan, H., Li, X. (2018) Speech Emotion Recognition by Combining Amplitude and Phase Information Using Convolutional Neural Network. Proc. Interspeech 2018, 1611-1615, DOI: 10.21437/Interspeech.2018-2156.


@inproceedings{Guo2018,
  author={Lili Guo and Longbiao Wang and Jianwu Dang and Linjuan Zhang and Haotian Guan and Xiangang Li},
  title={Speech Emotion Recognition by Combining Amplitude and Phase Information Using Convolutional Neural Network},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={1611--1615},
  doi={10.21437/Interspeech.2018-2156},
  url={http://dx.doi.org/10.21437/Interspeech.2018-2156}
}