Emotion Identification from Raw Speech Signals Using DNNs

Mousmita Sarma, Pegah Ghahremani, Daniel Povey, Nagendra Kumar Goel, Kandarpa Kumar Sarma, Najim Dehak


We investigate a number of Deep Neural Network (DNN) architectures for emotion identification with the IEMOCAP database. First we compare different feature extraction front-ends: we compare high-dimensional MFCC input (equivalent to filterbanks), versus frequency-domain and time-domain approaches to learning filters as part of the network. We obtain the best results with the time-domain filter-learning approach. Next we investigated different ways to aggregate information over the duration of an utterance. We tried approaches with a single label per utterance with time aggregation inside the network; and approaches where the label is repeated for each frame. Having a separate label per frame seemed to work best and the best architecture that we tried interleaves TDNN-LSTM with time-restricted self-attention, achieving a weighted accuracy of 70.6%, versus 61.8% for the best previously published system which used 257-dimensional Fourier log-energies as input.


 DOI: 10.21437/Interspeech.2018-1353

Cite as: Sarma, M., Ghahremani, P., Povey, D., Goel, N.K., Sarma, K.K., Dehak, N. (2018) Emotion Identification from Raw Speech Signals Using DNNs. Proc. Interspeech 2018, 3097-3101, DOI: 10.21437/Interspeech.2018-1353.


@inproceedings{Sarma2018,
  author={Mousmita Sarma and Pegah Ghahremani and Daniel Povey and Nagendra Kumar Goel and Kandarpa Kumar Sarma and Najim Dehak},
  title={Emotion Identification from Raw Speech Signals Using DNNs},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={3097--3101},
  doi={10.21437/Interspeech.2018-1353},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1353}
}