EUROSPEECH 2003 - INTERSPEECH 2003
This paper presents and discusses the speaker dependent emotion recognition with large set of statistical features. The speaker dependent emotion recognition gains in present the best accuracy performance. Recognition was performed on English, Slovenian, Spanish, and French InterFace emotional speech databases. All databases include 9 speakers. The InterFace databases include neutral speaking style and six emotions: disgust, surprise, joy, fear, anger and sadness. Speech features for emotion recognition were determined in two steps. In the first step, acoustical features were defined and in the second statistical features were calculated from acoustical features. Acoustical features are composed from pitch, derivative of pitch, energy, derivative of energy, duration of speech segments, jitter, and shimmer. Statistical features are statistical presentations of acoustical features. In previous study feature vector was composed from 26 elements. In this study the feature vector was composed from 144 elements. The new feature set was called large set of statistical features. Emotion recognition was performed using artificial neural networks. Significant improvement was achieved for all speakers except for Slovenian male and second English male speaker were the improvement was about 2%. Large set of statistical features improve the accuracy of recognised emotion in average for about 18%.
Bibliographic reference. Hozjan, Vladimir / Kacic, Zdravko (2003): "Improved emotion recognition with large set of statistical features", In EUROSPEECH-2003, 133-136.