Predicting Arousal and Valence from Waveforms and Spectrograms Using Deep Neural Networks

Zixiaofan Yang, Julia Hirschberg


Automatic recognition of spontaneous emotion in conversational speech is an important yet challenging problem. In this paper, we propose a deep neural network model to track continuous emotion changes in the arousal-valence two-dimensional space by combining inputs from raw waveform signals and spectrograms, both of which have been shown to be useful in the emotion recognition task. The neural network architecture contains a set of convolutional neural network (CNN) layers and bidirectional long short-term memory (BLSTM) layers to account for both temporal and spectral variation and model contextual content. Experimental results of predicting valence and arousal on the SEMAINE database and the RECOLA database show that the proposed model significantly outperforms model using hand-engineered features, by exploiting waveforms and spectrograms as input. We also compare the effects of waveforms vs. spectrograms and find that waveforms are better at capturing arousal, while spectrograms are better at capturing valence. Moreover, combining information from both inputs provides further improvement to the performance.


 DOI: 10.21437/Interspeech.2018-2397

Cite as: Yang, Z., Hirschberg, J. (2018) Predicting Arousal and Valence from Waveforms and Spectrograms Using Deep Neural Networks. Proc. Interspeech 2018, 3092-3096, DOI: 10.21437/Interspeech.2018-2397.


@inproceedings{Yang2018,
  author={Zixiaofan Yang and Julia Hirschberg},
  title={Predicting Arousal and Valence from Waveforms and Spectrograms Using Deep Neural Networks},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={3092--3096},
  doi={10.21437/Interspeech.2018-2397},
  url={http://dx.doi.org/10.21437/Interspeech.2018-2397}
}