ISCA Archive AVSP 2019
ISCA Archive AVSP 2019

Learning Salient Features for Multimodal Emotion Recognition with Recurrent Neural Networks and Attention Based Fusion

Darshana Priyasad, Tharindu Fernando, Simon Denman, Sridha Sridharan, Clinton Fookes

Automatic emotion recognition is a challenging task since emotion is communicated through different modalities. Deep Convolution Neural Networks (DCNN) and transfer learning have shown success in automatic emotion recognition using different modalities. However significant improvement in accuracy is still required for practical applications. Existing methods are still not effective in modelling the temporal relationships within emotional expressions or in identifying the salient features from different modes and fusing them to improve accuracies. In this paper, we present an automatic emotion recognition system using audio and visual modalities. VGG19 models are used to capture frame level facial features followed by a Long Short Term Memory (LSTM) to capture their temporal distribution at a segment level. A separate VGG19 model captures auditory features from Mel Frequency Cepstral Coefficients (MFCC). The extracted auditory and visual features are fused together and a Deep Neural Network (DNN) with attention is used in classification using majority voting. Voice Activity Detection (VAD) on the audio stream improves performance by reducing the outliers in learning. The system is evaluated using Leave One Subject Out (LOSO) and K-fold cross-validation and our system outperforms state of the art methods on two challenging databases.


doi: 10.21437/AVSP.2019-5

Cite as: Priyasad, D., Fernando, T., Denman, S., Sridharan, S., Fookes, C. (2019) Learning Salient Features for Multimodal Emotion Recognition with Recurrent Neural Networks and Attention Based Fusion. Proc. The 15th International Conference on Auditory-Visual Speech Processing, 21-26, doi: 10.21437/AVSP.2019-5

@inproceedings{priyasad19_avsp,
  author={Darshana Priyasad and Tharindu Fernando and Simon Denman and Sridha Sridharan and Clinton Fookes},
  title={{Learning Salient Features for Multimodal Emotion Recognition with Recurrent Neural Networks and Attention Based Fusion}},
  year=2019,
  booktitle={Proc. The 15th International Conference on Auditory-Visual Speech Processing},
  pages={21--26},
  doi={10.21437/AVSP.2019-5}
}