EUROSPEECH 2003 - INTERSPEECH 2003
For emotion recognition, we selected pitch, log energy, formant, mel-band energies, and mel frequency cepstral coefficients (MFCCs) as the base features, and added velocity/ acceleration of pitch and MFCCs to form feature streams. We extracted statistics used for discriminative classifiers, assuming that each stream is a one-dimensional signal. Extracted features were analyzed by using quadratic discriminant analysis (QDA) and support vector machine (SVM). Experimental results showed that pitch and energy were the most important factors. Using two different kinds of databases, we compared emotion recognition performance of various classifiers: SVM, linear discriminant analysis (LDA), QDA and hidden Markov model (HMM). With the text-independent SUSAS database, we achieved the best accuracy of 96.3% for stressed/neutral style classification and 70.1% for 4-class speaking style classification using Gaussian SVM, which is superior to the previous results. With the speaker-independent AIBO database, we achieved 42.3% accuracy for 5-class emotion recognition.
Bibliographic reference. Kwon, Oh-Wook / Chan, Kwokleung / Hao, Jiucang / Lee, Te-Won (2003): "Emotion recognition by speech signals", In EUROSPEECH-2003, 125-128.