Improving Emotion Identification Using Phone Posteriors in Raw Speech Waveform Based DNN

Mousmita Sarma, Pegah Ghahremani, Daniel Povey, Nagendra Kumar Goel, Kandarpa Kumar Sarma, Najim Dehak


We propose to exploit phone posteriors as an additional feature in Deep Neural Network (DNN) to recognize emotions from raw speech waveform. The proposed DNN setup uses a time domain approach of learning filters within the network. The frame-level phone posteriors are combined with the learned feature representation through the network. Appended learned time domain features and phone posteriors are used as an input to the temporal context modeling layers which interleaves TDNN-LSTM with time-restricted self-attention. We achieve 16.48% relative error rate improvement in IEMOCAP categorical problem (with a final weighted accuracy of 75.03%) using phone posteriors compared to DNN setup which uses only learned time domain features for temporal context modeling. Further, we study the effect of learning emotion categories leveraging dimensional primitives in multi-task learning DNN model.


 DOI: 10.21437/Interspeech.2019-2093

Cite as: Sarma, M., Ghahremani, P., Povey, D., Goel, N.K., Sarma, K.K., Dehak, N. (2019) Improving Emotion Identification Using Phone Posteriors in Raw Speech Waveform Based DNN. Proc. Interspeech 2019, 3925-3929, DOI: 10.21437/Interspeech.2019-2093.


@inproceedings{Sarma2019,
  author={Mousmita Sarma and Pegah Ghahremani and Daniel Povey and Nagendra Kumar Goel and Kandarpa Kumar Sarma and Najim Dehak},
  title={{Improving Emotion Identification Using Phone Posteriors in Raw Speech Waveform Based DNN}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={3925--3929},
  doi={10.21437/Interspeech.2019-2093},
  url={http://dx.doi.org/10.21437/Interspeech.2019-2093}
}