Herein we present a comparison of novel concepts for a robust fusion of prosodic and verbal cues in speech emotion recognition. Thereby 276 acoustic features are extracted out of a spoken phrase. For linguistic content analysis we use the Bag-of-Words text representation. This allows for integration of acoustic and linguistic features within one vector prior to a final classification. Extensive feature selection by filter- and wrapper based methods is fulfilled. Likewise optimal sets via SVM-SFFS and single feature relevance by information gain ratio calculation are presented. Overall classification is realised by diverse ensemble approaches. Among base classifiers Kernel Machines, Decision Trees, Bayesian classifiers, and memory-based learners are found. Acoustics only tests ran on a database comprising 39 speakers for speaker independent accuracy analysis. Additionally the public Berlin Emotional Speech database is used. A further database of 4,221 movie related phrases forms the basis of acoustic and linguistic information analysis evaluation. Overall remarkable performance in the discrimination of seven discrete emotions could be observed.
Cite as: Schuller, B., Müller, R., Lang, M., Rigoll, G. (2005) Speaker independent emotion recognition by early fusion of acoustic and linguistic features within ensembles. Proc. Interspeech 2005, 805-808, doi: 10.21437/Interspeech.2005-379
@inproceedings{schuller05_interspeech, author={Björn Schuller and Ronald Müller and Manfred Lang and Gerhard Rigoll}, title={{Speaker independent emotion recognition by early fusion of acoustic and linguistic features within ensembles}}, year=2005, booktitle={Proc. Interspeech 2005}, pages={805--808}, doi={10.21437/Interspeech.2005-379} }