ISCA Archive AVSP 2019
ISCA Archive AVSP 2019

Multi-Modal Speech Emotion Recognition Using Speech Embeddings and Audio Features

Krishna D N, Sai Sumith Reddy

In this work, we propose a multi-modal emotion recognition model to improve the speech emotion recognition system performance. We use two parallel Bidirectional LSTM networks called acoustic encoder(ENC1) and speech embedding encoder (ENC2). The acoustic encoder is a Bi-LSTM which takes sequence of speech features as inputs and speech embedding encoder is also a Bi-LSTM which takes sequence of speech embeddings as input and the output hidden representation at the last time step of both the Bi-LSTM are concatenated and passed into a classification which predicts emotion label for that particular utterance. The speech embeddings are learned using the encoder-decoder framework as described in [1] using skipgram [2] training. These embeddings are shown to outperform word embeddings(word2vec) in many word similarity benchmarks. Speech embeddings are shown to capture semantic information present in speech and speech embeddings have the capabilities to handle speech variabilities which is not possible by plain text. We compare our model with the word embedding based model where we feed Word2Vec to ENC2 and speech features to ENC1 and we observe that the speech embedding based model gives better results compared to word embedding based model. We compare our system to previous multi-modal emotion recognition models which use text and speech features and we get absolute 2.59% improvement over the previous systems[8] on IEMOCAP dataset. We also compare our system performance with different speech embedding dimensions of [50,100,200,300] and we observe the speech embedding of 50 dimensions is achieving 68.59% accuracy.


doi: 10.21437/AVSP.2019-4

Cite as: N, K.D., Reddy, S.S. (2019) Multi-Modal Speech Emotion Recognition Using Speech Embeddings and Audio Features. Proc. The 15th International Conference on Auditory-Visual Speech Processing, 16-20, doi: 10.21437/AVSP.2019-4

@inproceedings{n19_avsp,
  author={Krishna D N and Sai Sumith Reddy},
  title={{Multi-Modal Speech Emotion Recognition Using Speech Embeddings and Audio Features}},
  year=2019,
  booktitle={Proc. The 15th International Conference on Auditory-Visual Speech Processing},
  pages={16--20},
  doi={10.21437/AVSP.2019-4}
}