A proposal for emotion recognition using speech features, transfer learning and convolutional neural networks

Roberto Móstoles, David Griol, Zoraida Callejas, Fernando Fernández-Martínez

In this paper, we present a proposal for emotion recognition using audio speech signal features consisting of two functionally independent systems. First, a voice activity detection module (VAD) acts as a filter prior to the emotion classification task. It extracts features from the input audio and uses a SVM classifier to predict the presence of voice activity. Secondly, the speech emotion classifier (EMO) transforms the power spectrum of the signal to a Mel scale and obtains a vector of its characteristics using a convolutional neural network. Emotion labels are assigned using this vector and a KNN classifier. The RAVDESS dataset has been used for training the models obtaining a maximum accuracy of 93.57% classifying 8 emotions.

doi: 10.21437/IberSPEECH.2021-12

Móstoles, R, Griol, D, Callejas, Z, Fernández-Martínez, F (2021) A proposal for emotion recognition using speech features, transfer learning and convolutional neural networks. Proc. IberSPEECH 2021, 56-60, doi: 10.21437/IberSPEECH.2021-12.