Speech Emotion Recognition Using Spectrogram & Phoneme Embedding

Promod Yenigalla, Abhay Kumar, Suraj Tripathi, Chirag Singh, Sibsambhu Kar, Jithendra Vepa


This paper proposes a speech emotion recognition method based on phoneme sequence and spectrogram. Both phoneme sequence and spectrogram retain emotion contents of speech which is missed if the speech is converted into text. We performed various experiments with different kinds of deep neural networks with phoneme and spectrogram as inputs. Three of those network architectures are presented here that helped to achieve better accuracy when compared to the state-of-the-art methods on benchmark dataset. A phoneme and spectrogram combined CNN model proved to be most accurate in recognizing emotions on IEMOCAP data. We achieved more than 4% increase in overall accuracy and average class accuracy as compared to the existing state-of-the-art methods.


 DOI: 10.21437/Interspeech.2018-1811

Cite as: Yenigalla, P., Kumar, A., Tripathi, S., Singh, C., Kar, S., Vepa, J. (2018) Speech Emotion Recognition Using Spectrogram & Phoneme Embedding. Proc. Interspeech 2018, 3688-3692, DOI: 10.21437/Interspeech.2018-1811.


@inproceedings{Yenigalla2018,
  author={Promod Yenigalla and Abhay Kumar and Suraj Tripathi and Chirag Singh and Sibsambhu Kar and Jithendra Vepa},
  title={Speech Emotion Recognition Using Spectrogram & Phoneme Embedding},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={3688--3692},
  doi={10.21437/Interspeech.2018-1811},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1811}
}