Emotion Recognition from Variable-Length Speech Segments Using Deep Learning on Spectrograms

Xi Ma, Zhiyong Wu, Jia Jia, Mingxing Xu, Helen Meng, Lianhong Cai


Deep neural network (DNN) dereverberation preprocessing has been shown to be a viable strategy for speech enhancement and increasing the accuracy of automatic speech recognition and automatic speaker verification. In this paper, an improved DNN technique based on convolutional neural networks is presented and compared to existing methods for speech enhancement and speaker verification in the presence of reverberation. This new technique is first shown to enhance speech quality as compared to other existing methods. Then, a more thorough set of experiments is presented that assesses cross-corpora speaker verification performance on data that contains real reverberation and noise. A discussion of the applicability In this work, an approach of emotion recognition is proposed for variable-length speech segments by applying deep neutral network to spectrograms directly. The spectrogram carries comprehensive para-lingual information that are useful for emotion recognition. We tried to extract such information from spectrograms and accomplish the emotion recognition task by combining Convolutional Neural Networks (CNNs) with Recurrent Neural Networks (RNNs). To handle the variable-length speech segments, we proposed a specially designed neural network structure that accepts variable-length speech sentences directly as input. Compared to the traditional methods that split the sentence into smaller fixed-length segments, our method can solve the problem of accuracy degradation introduced in the speech segmentation process. We evaluated the emotion recognition model on the IEMOCAP dataset over four emotions. Experimental results demonstrate that the proposed method outperforms the fixed-length neural network on both weighted accuracy (WA) and unweighted accuracy (UA).and generalizability of such techniques is given.


 DOI: 10.21437/Interspeech.2018-2228

Cite as: Ma, X., Wu, Z., Jia, J., Xu, M., Meng, H., Cai, L. (2018) Emotion Recognition from Variable-Length Speech Segments Using Deep Learning on Spectrograms. Proc. Interspeech 2018, 3683-3687, DOI: 10.21437/Interspeech.2018-2228.


@inproceedings{Ma2018,
  author={Xi Ma and Zhiyong Wu and Jia Jia and Mingxing Xu and Helen Meng and Lianhong Cai},
  title={Emotion Recognition from Variable-Length Speech Segments Using Deep Learning on Spectrograms},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={3683--3687},
  doi={10.21437/Interspeech.2018-2228},
  url={http://dx.doi.org/10.21437/Interspeech.2018-2228}
}