Speech Emotion Recognition from Variable-Length Inputs with Triplet Loss Function

Jian Huang, Ya Li, Jianhua Tao, Zhen Lian


Automatic emotion recognition is a crucial element on understanding human behavior and interaction. Prior works on speech emotion recognition focus on exploring various feature sets and models. Compared with these methods, we propose a triplet framework based on Long Short-Term Memory Neural Network (LSTM) for speech emotion recognition. The system learns a mapping from acoustic features to discriminative embedding features, which are regarded as basis of testing with SVM. The proposed model is trained with triplet loss and supervised loss simultaneously. The triplet loss makes intra-class distance shorter and inter-class distance longer and supervised loss incorporates class label information. In view of variable-length inputs, we explore three different strategies to handle this problem and meanwhile make better use of temporal dynamic process information. Our experimental results on the Interactive Emotional Motion Capture (IEMOCAP) database reveal that the proposed methods are beneficial to performance improvement. We demonstrate promise of triplet framework for speech emotion recognition and present our analysis.


 DOI: 10.21437/Interspeech.2018-1432

Cite as: Huang, J., Li, Y., Tao, J., Lian, Z. (2018) Speech Emotion Recognition from Variable-Length Inputs with Triplet Loss Function. Proc. Interspeech 2018, 3673-3677, DOI: 10.21437/Interspeech.2018-1432.


@inproceedings{Huang2018,
  author={Jian Huang and Ya Li and Jianhua Tao and Zhen Lian},
  title={Speech Emotion Recognition from Variable-Length Inputs with Triplet Loss Function},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={3673--3677},
  doi={10.21437/Interspeech.2018-1432},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1432}
}